Interaction Stimuli: Design and Previews

The novel human interaction stimuli consists of 27 event scenarios structured within film-like narratives, organized across nine core story contexts (S1–S9).

Each scenario embeds diverse interaction-based visuo-auditory cues (such as gaze shifts and pointing gestures) and is enacted in three variations (v1, v2, v3) to introduce controlled differences in interaction patterns while maintaining the overarching story structure.

You can preview (30s) a version (v1) for each story context (S1–S9, select scene to play) (click video to toggle sound).

Select a scene below

General Visuo-Auditory Cues Example

We classified five fundamental visuoauditory cues: Speaking, Gaze, Motion, Hand Action, and Exit/Entry. These prevalent interaction features drive joint attention, turn-taking, and social coordination in natural human interactions. Some example cueing instances (hover to play, click for sound).

Speaking | Gaze | Turning | Moving | Hand Action |Exit

Speaking | Gaze | Reaching Hand Action | Moving | Entry

Speaking (turn-taking) | Gaze | Turning | Moving | Entry

Speaking | Gaze (transition)

Speaking (turn-taking) | Gaze | Hand Action | Moving | Entry

Speaking (turn-taking) | Gaze | Hand Action | Moving

Speaking (turn-taking) | Gaze | Turning | Moving | Entry

Speaking (turn-taking) | Gaze | Hand Action | Moving | Entry

Speaking (turn-taking) | Gaze | Hand Action | Moving | Entry

Context-locked Visuo-Auditory Cues Example

Identical narrative segments with controlled visuoauditory variations (v1/v2/v3) that preserve story context and actor intent. Enables direct comparison of viewer attention metrics across manipulated interactions. Some example cueing instances (click media to toggle sound):

Spatial Position

Scene S1-v1
Scene S1-v2
Scene S1-v3

Gaze Transition

Scene S4-v1
Scene S4-v2
Scene S4-v3

Joint Attention Initiation

Scene S9-v1
Scene S9-v2
Scene S9-v3

Associated Data Components

These novel interaction stimuli are annotated with detailed semantic labels for scene components and structures, describing what occurs within each event and how visual attention is directed toward scene elements—based on eye-tracking data collected from 90 participants.

Scene Elements

Recording setup

Visuoauditory features of the scenes are categorized by form (objects/regions) and state (dynamic/static). Annotations focus on dynamic objects — persons/robots — with body-part labels (e.g., face, head) tracked via bounding boxes across all scenes. Bounding box data includes 32 labeled entities spanning 11,801 seconds.

Scene Structure

Eye-tracking visualization

Scene interactions are classified by observable modalities explaining observer perspectives. These visuoauditory features detail context-agnostic entity actions through ELAN annotations, including: speaking, gaze, body pose, hand action, head movement, motion (displacement), and factual aspects (spatial position, visibility, presence).

Visual Attention

Annotation interface

Visual attention is annotated in ELAN using bounding-box AOIs over Scene Elements, identifying fixated targets at ~6ms intervals. These include low-level saccades and fixations, and high-level object-level attention toward persons/robots and their body parts, semi-automatically annotated via collected eye-tracking data (n=90).

Get in touch

Dataset is available via: [ www ]
Send us an email if you would like to access the private link, or discuss some aspect pertaining to the work or would like to explore collaborations:
Vipul Nair - nair@codesign-lab.org