Talk to a Statue: Building A Multi-Modal ElevenAgents-Powered App

Written by: Joe Reeve
Published: Feb 18, 2026
Last updated: Jun 29, 2026

ListenListen to this article

0:00

0:000:00

Photograph a statue. Identify the figures depicted. Then have a real-time voice conversation with them - each character speaking in a distinct, period-appropriate voice.

That is what you can build with ElevenLabs' Voice Design and Agent APIs. In this post, we walk through the architecture of a mobile web app that combines computer vision with voice generation to turn public monuments into interactive experiences. Everything here is replicable with the APIs and code samples below.

Skip the tutorial - build it in one prompt

The entire app below was built from a single prompt, tested to successfully one-shot in Cursor with Claude Opus 4.5 (high) from an empty NextJS project. If you want to skip ahead and build your own, paste this into your editor:

We need to make an app that:
- is optimised for mobile
- allows the user to take a picture (of a statue, picture, monument, etc) that includes one or more people
- uses an OpenAI LLM api call to identify the statue/monument/picture, characters within it, the location, and name
- allows the user to check it's correct, and then do either a deep research or a standard search to get information about the characters and the statue's history, and it's current location
- then create an ElevenLabs agent (allowing multiple voices), that the user can then talk to as though they're talking to the characters in the statue. Each character should use voice designer api to create a matching voice.
The purpose is to be fun and educational.

https://elevenlabscreator.arsenaldigitalweb.com.br/docs/eleven-api/guides/how-to/voices/voice-design
https://elevenlabscreator.arsenaldigitalweb.com.br/docs/eleven-agents/quickstart 
https://elevenlabscreator.arsenaldigitalweb.com.br/docs/api-reference/agents/create

You can also use the ElevenLabs Agent Skills instead of linking to the docs. These are based on the docs and can yield even better results.

The rest of this post breaks down what that prompt produces.

How it works

The pipeline has five stages:

Capture an image
Identify the artwork and its characters (OpenAI)
Research the history (OpenAI)
Generate unique voices for each character (ElevenAPI)
Start a real-time voice conversation over WebRTC (ElevenAgents)

Identifying the statue with vision

When a user photographs a statue, the image is sent to an OpenAI vision-capable model. A structured system prompt extracts the artwork name, location, artist, date, and - critically - a detailed voice description for each character. The system prompt includes the expected JSON output format:

{
  "statueName": "string - name of the statue, monument, or artwork",
  "location": "string - where it is located (city, country)",
  "artist": "string - the creator of the artwork",
  "year": "string - year completed or unveiled",
  "description": "string - brief description of the artwork and its historical significance",
  "characters": [
    {
      "name": "string - character name",
      "description": "string - who this person was and their historical significance",
      "era": "string - time period they lived in",
      "voiceDescription": "string - detailed voice description for Voice Design API (include audio quality marker, age, gender, vocal qualities, accent, pacing, and personality)"
    }
  ]
}

const response = await openai.chat.completions.create({
  model: "gpt-5.2",
  response_format: { type: "json_object" },
  messages: [
    { role: "system", content: SYSTEM_PROMPT },
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Identify this statue/monument/artwork and all characters depicted.",
        },
        {
          type: "image_url",
          image_url: {
            url: `data:image/jpeg;base64,${base64Data}`,
            detail: "high",
          },
        },
      ],
    },
  ],
  max_completion_tokens: 2500,
});

For a photograph of the Boudica statue on Westminster Bridge, London, the response looks like this:

{
  "statueName": "Boudica and Her Daughters",
  "location": "Westminster Bridge, London, UK",
  "artist": "Thomas Thornycroft",
  "year": "1902",
  "description": "Bronze statue depicting Queen Boudica riding a war chariot with her two daughters, commemorating her uprising against Roman occupation of Britain.",
  "characters": [
    {
      "name": "Boudica",
      "description": "Queen of the Iceni tribe who led an uprising against Roman occupation",
      "era": "Ancient Britain, 60-61 AD",
      "voiceDescription": "Perfect audio quality. A powerful woman in her 30s with a deep, resonant voice and a thick Celtic British accent. Her tone is commanding and fierce, with a booming quality that projects authority. She speaks at a measured, deliberate pace with passionate intensity."
    },
    // Other characters in the statue
  ]
}

Writing effective voice descriptions

The quality of the voice description directly determines the quality of the generated voice. The Voice Design prompting guide covers this in detail, but the key attributes to include are: audio quality marker ("Perfect audio quality."), age and gender, tone/timbre (deep, resonant, gravelly), a precise accent ("thick Celtic British accent" rather than just "British"), and pacing. More descriptive prompts yield more accurate results - "a tired New Yorker in her 60s with a dry sense of humor" will outperform "an older female voice" every time.

A few things worth noting from the guide: use "thick" rather than "strong" when describing accent prominence, avoid vague terms like "foreign," and for fictional or historical characters you can suggest real-world accents as inspiration (e.g., "an ancient Celtic queen with a thick British accent, regal and commanding").

Creating character voices with Voice Design

The Voice Design API generates new synthetic voices from text descriptions - no voice samples or cloning required. This makes it well-suited for historical figures where source audio does not exist.

The process has two steps.

Generate previews

const { previews } = await elevenlabs.textToVoice.design({
  modelId: "eleven_multilingual_ttv_v2",
  voiceDescription: character.voiceDescription,
  text: sampleText,
});

The text parameter matters. Longer, character-appropriate sample text (50+ words) produces more stable results - match the dialogue to the character rather than using a generic greeting. The Voice Design prompting guide covers this in more detail.

Save the voice

Once previews are generated, select one and create a permanent voice:

const voice = await elevenlabs.textToVoice.create({
  voiceName: `StatueScanner - ${character.name}`,
  voiceDescription: character.voiceDescription,
  generatedVoiceId: previews[0].generatedVoiceId,
});

For multi-character statues, voice creation runs in parallel. Five characters' voices generate in roughly the same time as one:

const results = await Promise.all(
  characters.map((character) => createVoiceForCharacter(character))
);

Building a multi-voice ElevenLabs Agent

With voices created, the next step is configuring an ElevenLabs Agent that can switch between character voices in real time.

const agent = await elevenlabs.conversationalAi.agents.create({
  name: `Statue Scanner - ${statueName}`,
  tags: ["statue-scanner"],
  conversationConfig: {
    agent: {
      firstMessage,
      language: "en",
      prompt: {
        prompt: systemPrompt,
        temperature: 0.7,
      },
    },
    tts: {
      voiceId: primaryCharacter.voiceId,
      modelId: "eleven_v3",
      supportedVoices: otherCharacters.map((c) => ({
        voiceId: c.voiceId,
        label: c.name,
        description: c.voiceDescription,
      })),
    },
    turn: {
      turnTimeout: 10,
    },
    conversation: {
      maxDurationSeconds: 600,
    },
  },
});

Multi-voice switching

The supportedVoices array tells the agent which voices are available. The Agents platform handles voice switching automatically - when the LLM's response indicates a different character is speaking, the TTS engine routes that segment to the correct voice.

Prompt engineering for group conversations

Making multiple characters feel like a real group - rather than a sequential Q&A - requires deliberate prompt design:

const multiCharacterRules = `
MULTI-CHARACTER DYNAMICS:
You are playing ALL ${characters.length} characters simultaneously.
Make this feel like a group conversation, not an interview.

- Characters should interrupt each other:
  "Actually, if I may -" / "Wait, I must say -"

- React to what others say:
  "Well said." / "I disagree with that..." / "Always so modest..."

- Have side conversations:
  "Do you remember when -" / "Tell them about the time you -"

The goal is for users to feel like they are witnessing a real exchange
between people who happen to include them.
`;

Real-time voice over WebRTC

The final piece is the client connection. ElevenLabs Agents support WebRTC for low-latency voice conversations - noticeably faster than WebSocket-based connections, which matters for natural turn-taking.

Server-side: get a conversation token

const { token } = await client.conversationalAi.conversations.getWebrtcToken({
    agentId,
});

Client-side: start the session

import { useConversation } from "@elevenlabs/react";

const conversation = useConversation({
  onConnect: () => setIsSessionActive(true),
  onDisconnect: () => setIsSessionActive(false),
  onMessage: (message) => {
    if (message.source === "ai") {
      setMessages((prev) => [...prev, { role: "agent", text: message.message }]);
    }
  },
});

await conversation.startSession({
  agentId,
  conversationToken: token,
  connectionType: "webrtc",
});

The useConversation hook handles audio capture, streaming, voice activity detection, and playback.

Adding research depth with web search

For users who want more historical context before starting a conversation, you can add an enhanced research mode using OpenAI's web search tool:

const response = await openai.responses.create({
  model: "gpt-5.2",
  instructions: RESEARCH_SYSTEM_PROMPT,
  tools: [{ type: "web_search_preview" }],
  input: `Research ${identification.statueName}. Search for current information
including location, visiting hours, and recent news about the artwork.`,
});

What we learned

This project shows that when combining different modalities of AI - text, research, vision, and audio - we’re able to build experiences that cross both the digital and real world. There’s a lot of unexplored potential in multi-modal agents that we’d love to see more people explore for education, work, and fun.

Start building

The APIs used in this project - Voice Design, ElevenAgents, and OpenAI - are all available today.

Talk to a Statue: Building A Multi-Modal ElevenAgents-Powered App

Skip the tutorial - build it in one prompt

How it works

Identifying the statue with vision

Writing effective voice descriptions

Creating character voices with Voice Design

Generate previews

Save the voice

Building a multi-voice ElevenLabs Agent

Multi-voice switching

Prompt engineering for group conversations

Real-time voice over WebRTC

Server-side: get a conversation token

Client-side: start the session

Adding research depth with web search

What we learned

Start building

Similar articles

How we engineered RAG to be 50% faster

Introducing Tests for ElevenLabs Agents

Introducing ElevenLabs Agents

ElevenLabs Agents can now navigate IVR phone trees

Skip the tutorial - build it in one prompt

How it works

Identifying the statue with vision

Writing effective voice descriptions

Creating character voices with Voice Design

Generate previews

Save the voice

Building a multi-voice ElevenLabs Agent

Multi-voice switching

Prompt engineering for group conversations

Real-time voice over WebRTC

Server-side: get a conversation token

Client-side: start the session

Adding research depth with web search

What we learned

Start building

Similar articles

How we engineered RAG to be 50% faster

Introducing Tests for ElevenLabs Agents

Introducing ElevenLabs Agents

​​ElevenLabs Agents can now navigate IVR phone trees

ElevenLabs Agents can now navigate IVR phone trees