When
3. September.Time: 12:30 - 14:00
What
The rapid evolution of vision-language models is transforming the landscape of image and video understanding, going beyond traditional classification and localization paradigms. In this seminar you will explore two recent methodologies that challenge the conventional reliance on predefined vocabularies and training data.
The first part of the talk introduces the concept of Vocabulary-Free Image Classification (VIC), a novel approach that assigns classes to images without the constraints of a fixed vocabulary. You will delve into the challenges of operating within an unconstrained semantic space containing millions of concepts and present Category Search from External Databases (CaSED), a training-free method that leverages external vision-language databases for efficient and accurate classification. In the second part, we will shift focus to Test-Time Zero-Shot Temporal Action Localization (ZS-TAL), which tackles the problem of identifying and locating unseen actions in untrimmed videos without the need for annotated training data. The seminar will introduce the Test-Time adaptation for Temporal Action Localization (T3AL) approach, which adapts a pre-trained Vision and Language Model (VLM) to perform action localization in a self-supervised manner, significantly improving generalization across diverse video domains. Finally you will be shown how LLMs can be used as a sort of orchestrator to solve research problems autonomously, through visual programming.