Germany  (German Aerospace Center - DLR)   \    Italy  (University of Trento)   \    Malta  (University of Malta)   \    Netherlands  (Utrecht University)   \    Portugal (University of Lisbon)   \    Sweden  (Örebro University.  Chalmers University of Technology.)   \    Turkey  (Hacettepe University.  Koç University.)

Visually Grounded Communication

Prof. Raffaella Bernardi
University of Trento, ITALY   /   

In the last years we have experienced amazing progress on grounded conversational agents. The community has been challenged by interesting visual dialogue datasets based on referential guessing tasks. The original baselines have been overcome by new models that achieve relatively high task-success. In the first lecture, I will give an overview of such tasks and models. In the second lecture, I will bring into the discussion an important aspect of communication: through their interactions, interlocutors establish common ground, exchange information, learn from each other and adapt to each other. I will present where the community stands with respect to this incremental, dynamic process.

Visuospatial Commonsense
- On Neurosymbolic Reasoning and Learning about Space and Motion

Prof. Mehul Bhatt
Örebro University, SWEDEN   /   

This talk addresses computational cognitive vision and perception at the interface of (spatial) language, (spatial) logic, (spatial) cognition, and artificial intelligence. Summarising recent works, I present general methods for the semantic interpretation of dynamic visuospatial imagery with an emphasis on the ability to (neurosymbolically) perform abstraction, reasoning, and learning with cognitively rooted structured characterisations of commonsense knowledge pertaining to space and motion. I will particularly highlight:

The presented works -demonstrated in the backdrop of applications in autonomous driving, cognitive robotics, visuoauditory media, and cognitive psychology- are intended to serve as a systematic model and general methodology integrating diverse, multi-faceted AI methods pertaining Knowledge Representation and Reasoning, Computer Vision, and Machine Learning towards realising practical, human-centred, computational visual intelligence. I will conclude by highlighting a bottom-up interdisciplinary approach -at the confluence of Cognition, AI, Interaction, and Design Science- necessary to better appreciate the complexity and spectrum of varied human-centred challenges for the design and (usable) implementation of (explainable) artificial visual intelligence solutions in diverse human-system interaction contexts.

Multimodal Commonsense Reasoning

Prof. Aykut Erdem
Koç University, Istanbul, TURKEY   /   

One of the long-standing goals of AI is to build computers that have a deep understanding of the world and reason about its various aspects. Towards this end, deep learning based techniques have made significant progress on many fronts in computer vision and natural language processing fields. More recently, there has been a growing interest in moving beyond conventional text understanding or visual recognition tasks to explore the interaction between vision and language. Although we have also seen considerable advances in the development of such multimodal models, these systems are largely lack of commonsense intelligence. In this talk, I will first discuss research efforts made throughout the past several years on assessing multimodal commonsense reasoning abilities of deep models. Then, I will summarize our recent and ongoing contributions on designing benchmark datasets for understanding commonsense procedural knowledge and intuitive reasoning about physical interactions of objects.

Multimodal Learning with Vision and Language

Prof. Erkut Erdem
Hacettepe University, TURKEY   /   

Multimodal machine learning aims at developing methods that are capable of dealing with problems which require modelling and integration of multiple, mostly complementary, modalities. In the past few years, recent advances in deep learning have led to unified architectures that can efficiently process multimodal data, with many interesting applications in computer vision and natural language processing.
In this tutorial, we will provide a comprehensive review and analysis of the methods related to multimodal learning techniques, specifically focusing on visual and textual data. We will discuss neural network architectures that have been generally used in computer vision and natural language processing such as convolutional neural networks (CNNs), recurrent neural networks (RNNs) and transformers along with the deep generative models such as variational auto-encoders (VAEs) and generative adversarial networks (GANs), and talk about the recent advancements in multimodal pre-training. We will then talk about and cover some newly emerging tasks that combine vision and language such as image captioning, visual question answering, visual reasoning, visual dialogue, image synthesis from text, language guided image manipulation.

Vision and Language Models and the Symbol Grounding Problem

Prof. Albert Gatt
Utrecht University, NETHERLANDS   /   

This lecture will focus on the classic problem of symbol grounding for models that deal with natural language. Broadly speaking, this problem concerns how symbolic information (e.g. words or phrases) connect with the world of perception and action. We will approach this topic from two angles. First, we consider approaches to the grounding problem from the perspective of cognitive science, paying special attention to the importance of grounding for theories and models of meaning. Then, we look at ongoing work in the field of Vision and Language research. At present, this field is dominated by deep, transformer-based models which are pretrained on large datasets composed of pairings of images and descriptions. The goal of pretraining is to enable the models to establish cross-modal correspondences, so that linguistic (i.e. symbolic) data is “grounded” in visual features. After an overview of model architectures and pre-training approaches, we delve into recent work on analysing the grounding capabilities of such models, using a variety of techniques, including foil-based methods and ablation. Models often turn out to have limited grounding capabilities, despite their good performance on downstream multimodal tasks such as Visual Question Answering.

Responsible AI and Responsible Machine Translation: challenges and initiatives

Prof. Helena Gorete Silva Moniz
University of Lisbon, PORTUGAL   /   

Responsible AI, Green AI, AI for Social Good, Fair AI, Ethically Aligned Design are all terms encompassed under a generic umbrella usually pertained to Ethics and AI. This topic is very disruptive, but still fairly broad and with substantial concrete actions needed, since it is known that the political power and industry will be heavily impacted by the legislation and market rules of Responsible (in the sense of accountable) AI at present and in the near future. The realms of the impact of Responsible AI are not just confined to the technical, research, and industry applications, they are also implicated in everyday's actions of each citizen and affects all areas of society. In this lecture, we will be describing the core concepts and main pillars of Responsible AI, we will emphasise the main initiatives created, and we will discuss, based on concrete examples, the ethical concerns around several AI applications. We will zoom in and have a closer look on Machine Translation (MT) and other Natural Language Processing (NLP) areas, to tune our contextualised view on Responsible AI and apply it to the scope of multimodal, multilingual and multicultural MT. We will have an active learning experience as much as possible, discussing concepts and broad views on Responsible AI and MT with practical examples and use cases. As a sneak-peak of the possible topics targeted, we will discuss, e.g., data origin, legal and copyright aspects; speech as a personally identifiable biomarker; vision and ethics; multimodality generation; end-users training on AI and awareness; metrics and reproducibility; models compression and Green AI; societal implications. The lecture will also cover examples of research projects in the field and initiatives to create world-wide Centres for Responsible AI.

Detecting Relations In Images: Models, Applications and Challenges

Prof. Adrian Muscat
University of Malta, MALTA   /   

Detecting relations between objects in images is a non-trivial problem in grounded language generation and understanding. This lecture, which is delivered from a computer science and engineering perspective, starts by defining and motivating the Visual Relation Detection sub-task that is required in Vision and Language tasks such as Visual Question Answering and Image Caption Generation and their subsequent use in robotics. Various models that explicitly or implicitly detect or generate relations are described followed by an analysis of where the models fail. The lecture is concluded with an in-depth discussion on potential solutions.

Explainable AI meets Robotics - Robots that Learn and Reason from Experiences

Prof. Karinne Ramirez-Amaro
Chalmers University of Technology, SWEDEN   /   

The advances in Collaborative Robots (Cobots) have rapidly increased with the development of novel data- and knowledge-driven methods. These methods allow robots, to some extent, to explain their decisions. This research area is known as Explainable AI and is gaining importance in the robotics community. One advantage of such methods is the increase of human trust towards Cobots since robots could explain their decisions, especially when errors occur or when facing new situations. Explainability is a challenging and important component when deploying Cobots into real and dynamic environments. In this talk, I will introduce a novel semantic-based learning method that generates compact and general models to infer human activities. I will also explain our current learning approaches to enable Cobots to learn from experience. Reasoning and learning from experiences are key when developing general-purpose machine learning methods. These experiences will allow robots to remember the best strategies to achieve a goal. Therefore, the new generation of robots should reason based on past experiences while providing explanations in case of errors. Thus, improving the autonomy of robots and human’s trust to work with robots.

Neurosymbolic Learning:
On Generalising Relational Visuospatial and Temporal Structure

Dr. Jakob Suchan
German Aerospace Center (DLR), GERMANY   /   

We present recent and emerging research aimed at developing a general framework for structured spatio-temporal learning from multimodal human behavioural stimuli. The framework and its underlying general, modular methods serve as a model for the application of integrated (neural) visuo-auditory processing and (semantic) relational learning foundations for applications (primarily) in the behavioural sciences. Furthermore, the lecture will situate neurosymbolic learning within the broader context of cognitive vision and perception research aimed developing general methods for commonsense reasoning with cognitively rooted structured characterisations of knowledge pertaining to space and motion.