INSTITUTE COURSES










LECTURES AND KEYNOTES



Towards a general account of multimodal meaning relations and its application to news media
Prof. John Bateman, University of Bremen, GERMANY    /   

In this talk I introduce a general framework for analysing multimodal meaning making that incorporates a variety of theoretical constructs and methodological principles for engaging with complex media combining diverse forms of expression, such as text, moving images, static images, diagrams and so on. Particular attention will be paid to the close link drawn between the structure of the theory and practical decisions for annotating complex media for further study, both corpus-oriented and experimental. Illustrative examples will be drawn on throughout and some recommendations for further lines of development proposed.




Representation  Mediated  Multimodality:  A Confluence of AI and Cognition
Prof. Mehul Bhatt, Örebro University, SWEDEN    /   

In this lecture, I will address three questions relevant to computationally ``making-sense'' of (embodied) multimodal interaction:   (1) What kind of relational abstraction mechanisms are needed to perform (explainable) grounded inference, e.g., question-answering, qualitative generalisation, hypothetical reasoning, relevant to embodied multimodal interaction?  (2) How can such abstraction mechanisms be founded on behaviourally established cognitive human-factors emanating from naturalistic empirical observation? and (3) How to articulate behaviourally established abstraction mechanisms as formal declarative models suited for grounded (computational) knowledge representation and reasoning (KR) as part of large-scale hybrid AI or (computational) cognitive systems.

I will contextualise (1-3) in the backdrop of key results at the interface of (spatial) language, knowledge representation and reasoning, and visuo-auditory computing. The lectures will focus on summarising recent and ongoing research towards establishing a human-centric foundation and roadmap for the development of (neurosymbolically) grounded inference about embodied multimodal interaction. Here, intended functional purposes addressed encompass diverse operative needs such as explainable multimodal commonsense understanding, multimodal summarisation, multimodal interpretation guided decision-support, analytical visualisation. Through the lectures and tutorial, the overall purpose is to highlight the significance of ``Representation Mediated Multimodality'' as a foundation for next-generation, human-centred AI applicable in diverse domains where multimodality or multimodal interaction are crucial, e.g., in Digital Visuo-Auditory Media (e.g., news, movies), Autonomous Vehicles, Social and Industrial Robots, User Experience and Interaction Design.



SEMANTIC PARSING OF MULTIMODAL DATA
Dr. Johanna Björklund, Umeå University, SWEDEN   /   

We give an accessible, high-level, introduction to semantic parsing of multimodal data. In other words, the translation of composite media items such as a video with audio tracks and subtitles, or a digital news article with text and images, into structured representations that capture central aspects of the combined media. The problem appears in many technological areas. In robotics, it takes the form of language grounding, where the linguistic constituents of a natural language command are linked to real- world objects, attributes, and relations. In media asset management, it appears as automatic captioning of images. It is also of inherent value in machine learning, because it allows us to transfer knowledge between different modalities: Knowledge that we have learnt from text can, e.g., be used to understand images. The focus is on so-called neuro-symbolic methods, that combine the power of deep-learning methods with the transparency and control of rule-based methods.



COMPUTATIONAL MODELS FOR MULTIMODAL INFORMATION
Prof. Ralph Ewerth, Leibniz University, Germany    /   

Multimodal information is omnipresent in the Web and comprises, for example, videos, online news, ed- ucational resources, scientific talks and papers. Besides Web search, there are other important scenar- ios and applications for multimodal data, e.g., human-computer interaction, digital humanities (graphic novels, audiovisual data), or learning environments. From a computational perspective it is difficult to adequately automatically interpret multimodal information and cross-modal semantic relations. One reason is that the automatic understanding and interpretation of textual, visual or audio sources them- selves is difficult and it is even more difficult to model and understand the interplay of two different modalities. While the visual/verbal divide has been investigated in the communication sciences for years, it has been rarely considered from an information retrieval perspective which we do in this lecture. We present machine learning approaches to automatically recognize semantic cross-modal relations that are defined along several dimensions: cross-modal mutual information, semantic correlation, status, and abstractness. These dimensions are based on own previous work and other taxonomies. The pre- sented approaches utilize deep neural networks, for which typically a large amount of training data is needed. We describe two strategies to overcome this issue. Finally, we outline possible use cases in the fields of search as learning and news exploration.



APPLYING THE RELEVANCE PRINCIPLE TO VISUAL AND MULTIMODAL COMMUNICATION
Prof. Charles Forceville, University of Amsterdam, NETHERLANDS    /   

The humanities are in need of a general, all-encompassing model of communication. The contours of such a model actually already exist: Relevance Theory: Communication and Cognition (Blackwell 1995 [1986]). In this monograph anthropologist Dan Sperber and linguist Deirdre Wilson claim that Paul Grice’s four “maxims of conversation” (of quantity, quality, relation, and manner) can actually be reduced to a single one: the maxim of relevance. RT’s central claim is that each act of communication comes with the presumption of its own optimal relevance to the envisaged audience. Hitherto, however, RT scholars (virtually all: linguists) have almost exclusively analysed face-to-face exchanges between two people who stand next to each other. The type of communication studied is thus predominantly verbal (perhaps supported by gestures and facial expressions). In order to fulfil RT’s potential to develop into an inclusive theory of communication, it is necessary to explore how it can be adapted and refined to account for (1) communication in other modes than (only) the spoken verbal mode; and for (2) mass-communication. In Visual and Multimodal Communication: Applying the Relevance Principle (Oxford UP 2020) I take a first step toward this goal by proposing how RT works for mass-communicative messages that involve static visuals. In my presentation I will first provide a crash course in “classic” RT for non-linguists, and go on to show what the theory can contribute to visual and multimodal communication by discussing some examples. Importantly, RT is no less but also no more than a model, and has little to contribute to the analysis of specific instances of communication. Therefore, RT cannot replace other theories and approaches that provide analytical models for interpreting specific discourses, such as (social) semiotics, narratology, and stylistics. It only aims to provide an all-encompassing communication model within which the insights from other approaches can be put to optimal use.



Multimodality in action recognition as a basis for
developing computational models in relation to human cognition
Prof. Paul Hemeren, University of Skövde, SWEDEN    /   

It is no small secret that human vision is highly sensitive to the movement of other individuals. This sensitivity, however, is not restricted to local motion patterns as such. When we see the movements of others, we do not merely see the independent movement of hands, arms, feet and legs and movement of the torso. Instead, we are able to quickly and accurately identify many motion patterns as meaningful actions (running, jumping, throwing, crawling, etc.). Understanding human action recognition by using point-light displays of biological motion allows us to then compare the accuracy of computational models in relation to human cognitive and perceptual factors. This area can be used to demonstrate some of the modality factors in human action recognition as well as the possible relationship between modality factors and levels of action and event perception. This lecture will present findings about different levels of action and event perception as well as direct comparisons between computational models and human cognition and perception using point-light displays of biological motion. A key question is then to evaluate the similarities and differences between human processing and computational models. To what extent should AI-development using multimodality computation in human-machine interaction be concerned about the relation between processes and results? What role should this comparison (computational models and human cognition) have in understanding human cognition? We will present the extent to which a computational model based on kinematics can determine action similarity and how its performance relates to human similarity judgments of the same actions. The comparative experiment results show that both the model and human participants can reliably identify whether two actions are the same or not. In another one of our studies, the affective motion of humans conveys messages that other humans perceive and understand without conventional linguistic processing. This ability to classify human movement into meaningful gestures or segments plays also a critical role in creating social interaction between humans and robots. We will also show the effect of levels of processing (top-down vs. bottom-up) on the perception of movement kinematics and primitives for grasping actions in order to gain insight into possible primitives used by the mirror system. We investigated whether or not segmentation was driven primarily by the kinematics of the action, as opposed to high-level top-down information about the action and the object used in the action.



MULTIMODAL APPROACHES TO ORAL CORPORA
Dr. Inés Olza, University of Navarra, SPAIN    /   

The multimodal turn in interactional pragmatics and conversation analysis has opened new paths in understanding the full complexity of human linguistic behavior. This turn calls to integrate words, ges- ture, prosody and the many shapes of speaker-space and speaker-speaker physical interaction in the analysis of the multilayered linguistic and communicative signals (Wagner, Malisz & Kopp 2014; Mon- dada 2019; Brown & Prieto 2021). Moreover, the dynamic relationship between the modalities involved in these complex communicative signals needs to be explained in light of high-scale cognitive operations and wider behavioral patterns such as alignment (Rasenberg, Ozyurek & Dingemanse 2020), conceptual integration (Valenzuela et al. 2020) or memory processes (Schubotz et al. 2020), among others. In parallel, the speed of digital advances now allows to collect, store and access unprecedented amounts of ecologically valid interactional data. Traditional oral corpora are giving the way to multimodal corpora where multilayered signals can be fully described in real situated contexts (Steen et al. 2018). Such data thus incorporate individual, intentional, social (intersubjective) and context-dependent variables that model how multimodality works in face-to-face interaction. My lectures will explore the interplay between multimodal analyses of interaction and corpus linguistics in several directions, aiming to reflect on how adding the multimodal layer(s) impacts on how we build linguistic corpora, analyze and explain situated data, and support corpus-based theoretical conclusions. I will rely on four case studies, drawn from different languages and various interactional genre, to foster discussion on the necessary multimodal look at oral linguistic data. Lecture 1. Subtitle: Corpus-based approaches to multimodal interaction . Case study 1: big multimodal data to study negative constructions. Case study 2: small multimodal data to unfold the secrets of si- multaneous interpreting. Lecture 2. Subtitle: Building and managing multimodal corpora. Case study 3: NewsScape, a big mul- timodal dataset to study human communication. Case study 4: the impact of multimodality in corpus segmentation.



MULTIMODALITY AS DESIGN FEATURE OF HUMAN LANGUAGE CAPACITY
Prof. Asli Özyürek, Radboud University, NETHERLANDS    /   

One of the unique aspects of human language is that in face-to- face communication it is universally multimodal (e.g., Holler and Levinson, 2019; Perniss, 2018). All hearing and deaf communities around the world use vocal and/or visual modalities (e.g., hands, body, face) with different affordances for semiotic and linguistic expression (e.g., Goldin-Meadow and Brentani, 2015; Vigliocco et al., 2014; Özyürek and Woll, 2019). Hearing communities use both vocal and visual modalities, combining speech and gesture. Deaf communities can use the visual modality for all aspects of linguistic expression in sign language. Visual articulators in both cospeech gesture and sign, unlike speech, have unique affordances for visi- ble iconic, indexical (e.g., pointing) and simultaneous representations due to use of multiple articulators. Such expressions have been considered in traditional linguistics as being external to the language sys- tem. I will however argue and show evidence for the fact that both spoken languages and sign languages combine such modality-specific expressions with arbitrary, categorical and sequential expressions in their language structures in cross-linguistically different ways (e.g.,Kita and Özyürek, 2003; Slinmska, Özyürek and Capirci, 2020; Özyürek, 2018; 2021). Furthermore they modulate language processing, in- teraction and dialogue (e.g., Rasenberg, Özyürek, and Dingemanse, 2020) and language acquisition (e.g., Furman, Kuntay, Özyürek, 2014), suggesting that they are part a design feature of a unified multimodal language system. I will end my talk with discussion on how a multimodal (but not unimodal one ) view can actually explain the dynamic, adaptive and flexible aspects of our language system enabling op- timally to bridge the human biological, cognitive and learning constraints to the interactive, culturally varying communicative requirements of face-to-face context.



COMMUNICATION FACE-TO-FACE AND ON THE PAGE
Prof. Barbara Tversky Stanford University, UNITED STATES    /   

Everyday face-to-face communication (remember that?) is inherently multimodal. It involves not just the words coming forth from our mouths but also the ways the voice is modulated, it involves the movements of our faces and bodies, it involves the world around us, including those we are interacting with, it involves the history of communication with our partners and with others like, and unlike, our partners, Communication entails not just ideas, but the ways the ideas are strung together (or not). I will discuss some of the many multimodal ways ideas are expressed and some of the ways they are strung together.Everyday face-to-face communication (remember that?) is inherently multimodal. It involves not just the words coming forth from our mouths but also the ways the voice is modulated, it involves the movements of our faces and bodies, it involves the world around us, including those we are interacting with, it involves the history of communication with our partners and with others like, and unlike, our partners, Communication entails not just ideas, but the ways the ideas are strung together (or not). I will discuss some of the many multimodal ways ideas are expressed and some of the ways they are strung together.



TUTORIALS

Visuo-auditory narratives such as narrative film, online media are now unquestionably a central medium for the negotiation of issues of social relevance in myriad spheres of discourse, including the general public, in education, in transcultural studies, and so on. Yet despite the medium's prominence, exactly how this works is still poorly understood: the gulf between fine grained technical details of visuo-auditory narratives and abstract configurations of social import is still considered by many researchers to be too great for effective research.
The tutorial programme expands upon several of the themes covered during the lectures and keynotes towards addressing precisely this central research challenge by focusing on a highly constrained and yet crucial component of the visuo-auditory medium: the interpretation and synthesis of emotionally-engaging visuo-auditory narrative media through a confluence of empirically and cognitively well-founded formalisations of narrative and its workings by combining well-specified and mutually complementary approaches concerned with humanities-based analyses of narrative patterns, its recipients and contexts of reception, and fine-grained computational cognitive modelling of visuo-auditory narrative interpretation from the viewpoint of embodied multimodal interaction and formal narrative semantics.
The tutorial sessions will demonstrate state of the art techniques, as well as discuss opportunities for future research.


PANEL DISCUSSION

The panel discussion, also open to public, addresses the status quo, and emerging & outward looking questions relevant to Multimodality Research, its applications, and its significance towards the design and engineering of next-generation technologies for the synthesis, dissemination, and analyses of (multimodal) media content. The panel discussion will also reflect upon the opportunities and threats posed by emerging technologies in AI and Machine Learning particularly in the media context.


DOCTORAL COLLOQUIUM

The Doctoral Colloquium (DC) at the Institute on Multimodality 2022 is an opportunity primarily (but not exclusively) for early stage doctoral researchers to present ongoing or planned research in one or more of the key themes in scope of the institute. Participating DC members engage with institute faculty and participants throughout the institute through planned lectures, as well as in dedicated poster sessions and a final presentation event devoted solely for DC members. Participant details are available via the Institute 2022 Brochure.  


PERSPECTIVES WORKSHOP   /  MULTIMODALITY AND MEDIA STUDIES

The final event of Institute on Multimodality 2022 is a perspectives workshop aiming to articulate immediate follow-up actions to further document as well as advance research at the interface of ``Multimodality, Cognition. Society'' in the specific context of media and communications studies. Invited participants encompass the fields of media and communications, linguistics, cognitive science (perception), and computer science (AI and ML). Other details are available via the Institute 2022 Brochure.