Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

A new multimodal dataset and foundational agents for exploring goal-driven asynchronous collaboration through Pictionary.

Audio Overview

Listen to a brief overview of our research

Abstract

We introduce Sketchtopia, a large-scale dataset and AI framework designed to explore goal-driven, multimodal communication through asynchronous interactions in a Pictionary-inspired setup. Sketchtopia captures natural human interactions, including freehand sketches, open-ended guesses, and iconic feedback gestures, showcasing the complex dynamics of cooperative communication under constraints. It features over 20K gameplay sessions from 916 players, capturing 263K sketches, 10K erases, 56K guesses and 19.4K iconic feedbacks.

We introduce multimodal foundational agents with capabilities for generative sketching, guess generation and asynchronous communication. Our dataset also includes 800 human-agent sessions for benchmarking the agents. We introduce novel metrics to characterize collaborative success, responsiveness to feedback and inter-agent asynchronous communication. Sketchtopia pushes the boundaries of multimodal AI, establishing a new benchmark for studying asynchronous, goal-oriented interactions between humans and AI agents.

Key Contributions

Rich Dataset

Large-scale, multimodal data capturing real-world asynchronous sketching dynamics with iconic feedback.

Foundational Agents

DRAWBOT & GUESSBOT designed for asynchronous interaction, generative sketching, and feedback responsiveness.

New Metrics

Novel metrics (AAO, FRS, MATS) tailored to evaluate asynchronous multimodal collaboration effectiveness and naturalness.

Dataset Highlights: Multimodal & Asynchronous

20K+

Sessions
Rich collection capturing diverse human Pictionary gameplay.

263K+

Sketches
Massive corpus of iterative freehand drawings for visual communication.

56K+

Open-ended Guesses
Natural language guesses reflecting understanding of visual cues.

19K+

Iconic Feedback
Non-verbal cues (👍👎❓) guiding the collaborative process asynchronously.

916

Players
Data from a diverse participant group ensuring robust analysis.

800

Human-Agent Sessions
Valuable data from humans interacting with our agents.

Sketchtopia Agents

ACTIONDECIDER: The Asynchronous Controller

The ActionDecider is the core component that enables asynchronous communication. It acts as a lightweight controller, continuously monitoring the game state (sketches, guesses, feedback) and deciding when agents should act and what action they should take. This allows for fluid, human-like interaction without the constraints of turn-taking, mirroring real-world communication dynamics.

DRAWBOT: The Sketcher

DRAWBOT visually communicates target word through asynchronous sketching, leveraging state-of-the-art generative models fine-tuned for iterative refinement based on communication context.

  • Generates sketches from target concepts, adapting to canvas state.
  • Iteratively refines drawings based on guesses and feedback signals.
  • Capable of adapting iconic feedback (👍, 👎, ❓).
  • Operates asynchronously, deciding when to draw or stay idle.

GUESSBOT: The Guesser

GUESSBOT interprets evolving sketches and communication cues to make intelligent guesses, using a retrieval-based framework informed by historical interaction data.

  • Incorporates the current sketch canvas content using vision models.
  • Generates relevant textual guesses using efficient retrieval and filtering.
  • Acts asynchronously, deciding when new information warrants a guess.

Evaluating Agent Performance

AAO

Asynchronous Action Overlap

Measures concurrent actions between agents. Close AAO values to human suggests more natural, human-like interaction dynamics.

FRS

Feedback Responsiveness Score

Quantifies how effectively agents adapt to feedback (👍👎) and move towards goal.

MATS

Multimodal Action Timing Similarity

Compares agent action timing patterns with human interactions to assess the naturalness of pacing.

Example Sessions

Authors

Headshot of Mohd Hozaifa Khan

Mohd Hozaifa Khan

IIIT Hyderabad

Headshot of Ravi Kiran Sarvadevabhatla

Ravi Kiran Sarvadevabhatla

IIIT Hyderabad

Resources


Interactive Demo - Coming Soon!

Stay tuned for a live demo where you can experience Sketchtopia agents interacting.

In the meantime, explore the Dataset


Citation

@inproceedings{khan2025sketchtopia,
  author    = {Mohd Hozaifa Khan and Ravi Kiran Sarvadevabhatla},
  title     = {Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
  url       = {https://sketchtopia25.github.io/} 
}