Video representation learning and its applications

A talk by Mark Hsiao
Principal AI Architect, OPPO

Register to watch this content

By submitting your email you agree to the Terms of Service and Privacy Statement
Watch this content now

Stages covered by this talk

About this talk

In recent years, most of the accuracy gains for video recognition (such as video action classification) have come from the newly designed CNN architectures (e.g., 3D-CNNs). These models are trained by applying a deep CNN on single clip of fixed temporal length. Since each video segment are processed by the 3D-CNN module separately, the corresponding clip descriptor is local and the inter-clip relationships are inherently implicit. Common method that directly averages the clip-level outputs as a video-level prediction is prone to fail due to the lack of mechanism that can extract and integrate relevant information to represent the video.

In this talk, we will introduce a novel neural fusion network that can learn a better video-level presentation and greatly boost the existing video classifiers with the cost of a tiny computation overhead. It explicitly models the inter-dependencies between video clips to strengthen the receptive field of local clip descriptors. Experiments on large-scale benchmark dataset show a significant improvement of the proposed method comparing to existing approaches. Based on this new video representation, we also showcase several potential applications in the video domain.

Have you got yours yet?

Our All-Access Passes are a must if you want to get the most out of this event.

Check them out

Proudly supported by

Want to sponsor this event? contact us.