Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

Mohd Hozaifa Khan, Ravi Kiran Sarvadevabhatla

IIIT Hyderabad | CVIT

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

A new multimodal dataset and foundational agents for exploring goal-driven asynchronous collaboration through Pictionary.

Sketchtopia Teaser Visual 1 - Example Sketch Interaction

Sketchtopia Teaser Visual 2 - Diverse Sketches from Dataset

Sketchtopia Teaser Visual 3 - Asynchronous Communication Example

Audio Overview

Listen to a brief overview of our research

Download Audio

Abstract

We introduce Sketchtopia, a large-scale dataset and AI framework designed to explore goal-driven, multimodal communication through asynchronous interactions in a Pictionary-inspired setup. Sketchtopia captures natural human interactions, including freehand sketches, open-ended guesses, and iconic feedback gestures, showcasing the complex dynamics of cooperative communication under constraints. It features over 20K gameplay sessions from 916 players, capturing 263K sketches, 10K erases, 56K guesses and 19.4K iconic feedbacks.

We introduce multimodal foundational agents with capabilities for generative sketching, guess generation and asynchronous communication. Our dataset also includes 800 human-agent sessions for benchmarking the agents. We introduce novel metrics to characterize collaborative success, responsiveness to feedback and inter-agent asynchronous communication. Sketchtopia pushes the boundaries of multimodal AI, establishing a new benchmark for studying asynchronous, goal-oriented interactions between humans and AI agents.

Key Contributions

Rich Dataset

Large-scale, multimodal data capturing real-world asynchronous sketching dynamics with iconic feedback.

Foundational Agents

DRAWBOT & GUESSBOT designed for asynchronous interaction, generative sketching, and feedback responsiveness.

New Metrics

Novel metrics (AAO, FRS, MATS) tailored to evaluate asynchronous multimodal collaboration effectiveness and naturalness.

Dataset Highlights: Multimodal & Asynchronous

20K+

Sessions

Rich collection capturing diverse human Pictionary gameplay.

263K+

Sketches

Massive corpus of iterative freehand drawings for visual communication.

56K+

Open-ended Guesses

Natural language guesses reflecting understanding of visual cues.

19K+

Iconic Feedback

Non-verbal cues (👍👎❓) guiding the collaborative process asynchronously.

916

Players

Data from a diverse participant group ensuring robust analysis.

800

Human-Agent Sessions

Valuable data from humans interacting with our agents.

Sketchtopia Agents

ACTIONDECIDER: The Asynchronous Controller

The ActionDecider is the core component that enables asynchronous communication. It acts as a lightweight controller, continuously monitoring the game state (sketches, guesses, feedback) and deciding when agents should act and what action they should take. This allows for fluid, human-like interaction without the constraints of turn-taking, mirroring real-world communication dynamics.

ActionDecider: The Brains Behind Asynchronous Interaction

Multimodality Sketchtopia Agent Diagram

DRAWBOT: The Sketcher

DRAWBOT visually communicates target word through asynchronous sketching, leveraging state-of-the-art generative models fine-tuned for iterative refinement based on communication context.

Generates sketches from target concepts, adapting to canvas state.
Iteratively refines drawings based on guesses and feedback signals.
Capable of adapting iconic feedback (👍, 👎, ❓).
Operates asynchronously, deciding when to draw or stay idle.

Hierarchical Multimodal Agent Architecture Diagram.

DRAWBOT Architecture Diagram.

GUESSBOT: The Guesser

GUESSBOT interprets evolving sketches and communication cues to make intelligent guesses, using a retrieval-based framework informed by historical interaction data.

Incorporates the current sketch canvas content using vision models.
Generates relevant textual guesses using efficient retrieval and filtering.
Acts asynchronously, deciding when new information warrants a guess.

Hierarchical Multimodal Agent Architecture Diagram.

GUESSBOT Simplified Architecture Diagram

Evaluating Agent Performance

AAO

Asynchronous Action Overlap

Measures concurrent actions between agents. Close AAO values to human suggests more natural, human-like interaction dynamics.

FRS

Feedback Responsiveness Score

Quantifies how effectively agents adapt to feedback (👍👎) and move towards goal.

MATS

Multimodal Action Timing Similarity

Compares agent action timing patterns with human interactions to assess the naturalness of pacing.

Example Sessions

Target: ANGRY

Type: Human-Human

Key Guess: "Angry"

Feedback Given: 👍

Successful communication: Guesser guessed the correct emotion despite the sketch and feedback.

Target: ANGRY

Type: Human-Human

Key Guess: Fail Guess: afraid, mute, etc.

Feedback Given: ❓

Failed communication: Guesser failed to guess the correct emotion despite the sketch and feedback.

Target: DUSTBIN

Type: Human-Human

Key Guess: "Dustbin"

Feedback Given: No feedback

Successful communication: GUESSBOT guessed the correct target word despite the sketch and feedback.

Target: DUSTBIN

Type: Human-Human

Key Guess: Fail Guess: face, etc.

Feedback Given: 👎

Failed communication: Guesser failed to guess the correct target word despite the sketch and feedback.

Target: WALK

Type: Human-Agent

Key Guess: "Walk"

Feedback Given: No feedback

Successful communication: Guesser guessed the correct target word despite the sketch and feedback.

Target: WALK

Type: Human-Agent

Key Guess: Fail Guess: man, run, etc.

Feedback Given: 👎, 👍

Failed communication: Guesser failed to guess the correct target word despite the sketch and feedback.

Authors

Mohd Hozaifa Khan

IIIT Hyderabad

Ravi Kiran Sarvadevabhatla

IIIT Hyderabad

Resources

Interactive Demo - Coming Soon!

Stay tuned for a live demo where you can experience Sketchtopia agents interacting.

In the meantime, explore the Dataset

Citation

@inproceedings{khan2025sketchtopia,
  author    = {Mohd Hozaifa Khan and Ravi Kiran Sarvadevabhatla},
  title     = {Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
  url       = {https://sketchtopia25.github.io/} 
}