TRAINING SCHOOL - FACULTY / Courses

Mehul Bhatt
Aykut Erdem
Erkut Erdem

Albert Gatt
Helena Gorete Silva Moniz
Adrian Muscat

Carina Silberer
Jakob Suchan
To be confirmed

School faculty is presently being finalised.

Germany (University of Stutttgart. Constructor University.) \ Italy (University of Trento) \ Malta (University of Malta) \ Netherlands (Utrecht University) \ Portugal (University of Lisbon) \ Sweden (Örebro University. Chalmers University of Technology.) \ Turkey (Hacettepe University. Koç University.)

Visuospatial Commonsense
- On Neurosymbolic Reasoning and Learning about Space and Motion

Prof. Mehul Bhatt
Örebro University, SWEDEN /

This talk addresses computational cognitive vision and perception at the interface of (spatial) language, (spatial) logic, (spatial) cognition, and artificial intelligence. Summarising recent works, I present general methods for the semantic interpretation of dynamic visuospatial imagery with an emphasis on the ability to (neurosymbolically) perform abstraction, reasoning, and learning with cognitively rooted structured characterisations of commonsense knowledge pertaining to space and motion. I will particularly highlight:

explainable models of computational visuospatial commonsense at the interface of symbolic and neural techniques;
deep semantics, entailing systematically formalised declarative (neurosymbolic) reasoning and learning with aspects pertaining to space, space-time, motion, actions & events, spatio-linguistic conceptual knowledge;
general foundational commonsense abstractions of space, time, and motion needed for representation mediated (grounded) reasoning and learning with dynamic visuospatial stimuli

The presented works -demonstrated in the backdrop of applications in autonomous driving, cognitive robotics, visuoauditory media, and cognitive psychology- are intended to serve as a systematic model and general methodology integrating diverse, multi-faceted AI methods pertaining Knowledge Representation and Reasoning, Computer Vision, and Machine Learning towards realising practical, human-centred, computational visual intelligence. I will conclude by highlighting a bottom-up interdisciplinary approach -at the confluence of Cognition, AI, Interaction, and Design Science- necessary to better appreciate the complexity and spectrum of varied human-centred challenges for the design and (usable) implementation of (explainable) artificial visual intelligence solutions in diverse human-system interaction contexts.

Multimodal Commonsense Reasoning

Prof. Aykut Erdem
Koç University, Istanbul, TURKEY /

One of the long-standing goals of AI is to build computers that have a deep understanding of the world and reason about its various aspects. Towards this end, deep learning based techniques have made significant progress on many fronts in computer vision and natural language processing fields. More recently, there has been a growing interest in moving beyond conventional text understanding or visual recognition tasks to explore the interaction between vision and language. Although we have also seen considerable advances in the development of such multimodal models, these systems are largely lack of commonsense intelligence. In this talk, I will first discuss research efforts made throughout the past several years on assessing multimodal commonsense reasoning abilities of deep models. Then, I will summarize our recent and ongoing contributions on designing benchmark datasets for understanding commonsense procedural knowledge and intuitive reasoning about physical interactions of objects.

Multimodal Learning with Vision and Language

Prof. Erkut Erdem
Hacettepe University, TURKEY /

Multimodal machine learning aims at developing methods that are capable of dealing with problems which require modelling and integration of multiple, mostly complementary, modalities. In the past few years, recent advances in deep learning have led to unified architectures that can efficiently process multimodal data, with many interesting applications in computer vision and natural language processing.
In this tutorial, we will provide a comprehensive review and analysis of the methods related to multimodal learning techniques, specifically focusing on visual and textual data. We will discuss neural network architectures that have been generally used in computer vision and natural language processing such as convolutional neural networks (CNNs), recurrent neural networks (RNNs) and transformers along with the deep generative models such as variational auto-encoders (VAEs) and generative adversarial networks (GANs), and talk about the recent advancements in multimodal pre-training. We will then talk about and cover some newly emerging tasks that combine vision and language such as image captioning, visual question answering, visual reasoning, visual dialogue, image synthesis from text, language guided image manipulation.

Vision and Language Models and the Symbol Grounding Problem

Prof. Albert Gatt
Utrecht University, NETHERLANDS /

This lecture will focus on the classic problem of symbol grounding for models that deal with natural language. Broadly speaking, this problem concerns how symbolic information (e.g. words or phrases) connect with the world of perception and action. We will approach this topic from two angles. First, we consider approaches to the grounding problem from the perspective of cognitive science, paying special attention to the importance of grounding for theories and models of meaning. Then, we look at ongoing work in the field of Vision and Language research. At present, this field is dominated by deep, transformer-based models which are pretrained on large datasets composed of pairings of images and descriptions. The goal of pretraining is to enable the models to establish cross-modal correspondences, so that linguistic (i.e. symbolic) data is “grounded” in visual features. After an overview of model architectures and pre-training approaches, we delve into recent work on analysing the grounding capabilities of such models, using a variety of techniques, including foil-based methods and ablation. Models often turn out to have limited grounding capabilities, despite their good performance on downstream multimodal tasks such as Visual Question Answering.

Responsible AI and Responsible Machine Translation: challenges and initiatives

Prof. Helena Gorete Silva Moniz
University of Lisbon, PORTUGAL /

Responsible AI, Green AI, AI for Social Good, Fair AI, Ethically Aligned Design are all terms encompassed under a generic umbrella usually pertained to Ethics and AI. This topic is very disruptive, but still fairly broad and with substantial concrete actions needed, since it is known that the political power and industry will be heavily impacted by the legislation and market rules of Responsible (in the sense of accountable) AI at present and in the near future. The realms of the impact of Responsible AI are not just confined to the technical, research, and industry applications, they are also implicated in everyday's actions of each citizen and affects all areas of society. In this lecture, we will be describing the core concepts and main pillars of Responsible AI, we will emphasise the main initiatives created, and we will discuss, based on concrete examples, the ethical concerns around several AI applications. We will zoom in and have a closer look on Machine Translation (MT) and other Natural Language Processing (NLP) areas, to tune our contextualised view on Responsible AI and apply it to the scope of multimodal, multilingual and multicultural MT. We will have an active learning experience as much as possible, discussing concepts and broad views on Responsible AI and MT with practical examples and use cases. As a sneak-peak of the possible topics targeted, we will discuss, e.g., data origin, legal and copyright aspects; speech as a personally identifiable biomarker; vision and ethics; multimodality generation; end-users training on AI and awareness; metrics and reproducibility; models compression and Green AI; societal implications. The lecture will also cover examples of research projects in the field and initiatives to create world-wide Centres for Responsible AI.

Detecting Relations In Images: Models, Applications and Challenges

Prof. Adrian Muscat
University of Malta, MALTA /

Detecting relations between objects in images is a non-trivial problem in grounded language generation and understanding. This lecture, which is delivered from a computer science and engineering perspective, starts by defining and motivating the Visual Relation Detection sub-task that is required in Vision and Language tasks such as Visual Question Answering and Image Caption Generation and their subsequent use in robotics. Various models that explicitly or implicitly detect or generate relations are described followed by an analysis of where the models fail. The lecture is concluded with an in-depth discussion on potential solutions.

Multimodal Learning and Reasoning of Everyday Procedures

Prof. Carina Silberer
University of Stuttgart, GERMANY /

In order to instruct and interact with machines in everyday life using natural language, they need to be able to understand and model procedural tasks. This capability is thus relevant to the fields of NLP, human-computer interaction and therefore robotics, and multimodal machine learning in general. Despite its relevance, however, multimodal (visual-linguistic) modelling of procedures, i.e. the task of learning and understanding procedures from language and visual data, is still a challenge for current visual-linguistic models. This talk first introduces the challenges of automatic procedure learning, and then focuses on certain aspects and tasks that are essential for procedural learning of everyday tasks from visual-linguistic data:

the common sense types of "events" and "actions" that have proven very difficult for current VL systems
affordance learning, that is, modelling the actions that an object offers to individuals in a given environment
automatic learning from noisy data (that is, without human annotations)
multimodal learning from videos and images

The talk summarises existing work, its benefits and limitations, as well as approaches and applications to which procedural learning and the discussed aspects are relevant.

Neurosymbolic Learning:
On Generalising Relational Visuospatial and Temporal Structure

Prof. Jakob Suchan
Constructor University, GERMANY /

We present recent and emerging research aimed at developing a general framework for structured spatio-temporal learning from multimodal human behavioural stimuli. The framework and its underlying general, modular methods serve as a model for the application of integrated (neural) visuo-auditory processing and (semantic) relational learning foundations for applications (primarily) in the behavioural sciences. Furthermore, the lecture will situate neurosymbolic learning within the broader context of cognitive vision and perception research aimed developing general methods for commonsense reasoning with cognitively rooted structured characterisations of knowledge pertaining to space and motion.

Visuospatial Commonsense- On Neurosymbolic Reasoning and Learning about Space and Motion

Prof. Mehul Bhatt Örebro University, SWEDEN /

Multimodal Commonsense Reasoning

Prof. Aykut Erdem Koç University, Istanbul, TURKEY /

Multimodal Learning with Vision and Language

Prof. Erkut Erdem Hacettepe University, TURKEY /