Naturalistic Multimodal Dataset

We present a richly annotated multimodal dataset of naturalistic human interactions, featuring detailed visuo-auditory cues alongside eye-tracking data -- It captures what/how actions unfolds during events and where/how individuals attend to them.

Below, you will find representative examples highlighting the diversity of interactions and behavioral cues. For full details, please refer to our preprint article.

The complete dataset is available upon request.

Interaction Stimuli: Design and Previews

The novel human interaction stimuli consists of 27 event scenarios structured within film-like narratives, organized across nine core story contexts (S1–S9).

Each scenario embeds diverse interaction-based visuo-auditory cues (such as gaze shifts and pointing gestures) and is enacted in three variations (v1, v2, v3) to introduce controlled differences in interaction patterns while maintaining the overarching story structure.

You can preview (30s) a version (v1) for each story context (S1–S9, select scene to play) (click video to toggle sound).

Select a scene below

General Visuo-Auditory Cues Example

We classified five fundamental visuoauditory cues: Speaking, Gaze, Motion, Hand Action, and Exit/Entry. These prevalent interaction features drive joint attention, turn-taking, and social coordination in natural human interactions. Some example cueing instances (hover to play, click for sound).

Speaking | Gaze | Turning | Moving | Hand Action |Exit

Speaking | Gaze | Reaching Hand Action | Moving | Entry

Speaking (turn-taking) | Gaze | Turning | Moving | Entry

Speaking | Gaze (transition)

Speaking (turn-taking) | Gaze | Hand Action | Moving | Entry

Speaking (turn-taking) | Gaze | Hand Action | Moving

Speaking (turn-taking) | Gaze | Turning | Moving | Entry

Speaking (turn-taking) | Gaze | Hand Action | Moving | Entry

Context-locked Visuo-Auditory Cues Example

Identical narrative segments with controlled visuoauditory variations (v1/v2/v3) that preserve story context and actor intent. Enables direct comparison of viewer attention metrics across manipulated interactions. Some example cueing instances (click media to toggle sound):

Spatial Position

Scene S1-v1

Scene S1-v2

Scene S1-v3

Gaze Transition

Scene S4-v1

Scene S4-v2

Scene S4-v3

Joint Attention Initiation

Scene S9-v1

Scene S9-v2

Scene S9-v3

Associated Data Components

These novel interaction stimuli are annotated with detailed semantic labels for scene components and structures, describing what occurs within each event and how visual attention is directed toward scene elements—based on eye-tracking data collected from 90 participants.

Scene Elements

Visuoauditory features of the scenes are categorized by form (objects/regions) and state (dynamic/static). Annotations focus on dynamic objects — persons/robots — with body-part labels (e.g., face, head) tracked via bounding boxes across all scenes. Bounding box data includes 32 labeled entities spanning 11,801 seconds.

Scene Structure

Scene interactions are classified by observable modalities explaining observer perspectives. These visuoauditory features detail context-agnostic entity actions through ELAN annotations, including: speaking, gaze, body pose, hand action, head movement, motion (displacement), and factual aspects (spatial position, visibility, presence).

Visual Attention

Visual attention is annotated in ELAN using bounding-box AOIs over Scene Elements, identifying fixated targets at ~6ms intervals. These include low-level saccades and fixations, and high-level object-level attention toward persons/robots and their body parts, semi-automatically annotated via collected eye-tracking data (n=90).

Get in touch

Dataset is available via: [ www ]
Send us an email if you would like to access the private link, or discuss some aspect pertaining to the work or would like to explore collaborations:
Vipul Nair - nair@codesign-lab.org

Reference:

Vipul Nair, Jakob Suchan, Mehul Bhatt, Paul Hemeren (2025). Attentional Synchrony in Films: A Window to Visuospatial Characterization of Events. ACM SAP - Symposium on Applied Perception 2022: 8:1-7. https://doi.org/10.1145/3548814.3551466

Vipul Nair, Mehul Bhatt, Jakob Suchan, Erik Billing, and Paul Hemeren. 2025. A Naturalistic Embodied Human Multimodal Interaction Dataset: Systematically Annotated Behavioral Visuo-Auditory Cue and Attention Data. PsyArXiv (April 2025), 41 pages. DOI (PDF): https://doi.org/10.31234/osf.io/6ctwu_v1, Dataset Info URL: https://codesign-lab.org/cognitive-vision/multimodal-cues/ .

Naturalistic Multimodal Dataset: Visuo-Auditory Cue & Attention Data

Interaction Stimuli: Design and Previews

General Visuo-Auditory Cues Example

Speaking | Gaze | Turning | Moving | Hand Action |Exit

Speaking | Gaze | Reaching Hand Action | Moving | Entry

Speaking (turn-taking) | Gaze | Turning | Moving | Entry

Speaking | Gaze (transition)

Speaking (turn-taking) | Gaze | Hand Action | Moving | Entry

Speaking (turn-taking) | Gaze | Hand Action | Moving

Speaking (turn-taking) | Gaze | Turning | Moving | Entry

Speaking (turn-taking) | Gaze | Hand Action | Moving | Entry

Speaking (turn-taking) | Gaze | Hand Action | Moving | Entry

Context-locked Visuo-Auditory Cues Example

Spatial Position

Gaze Transition

Joint Attention Initiation

Associated Data Components

Scene Elements

Scene Structure

Visual Attention

Get in touch

Reference: