IP-Prompter: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting

¹MAIS, Institute of Automation, CAS
²School of Artificial Intelligence, UCAS
³ByteDance Inc.
⁴University of Konstanz
⁵National Cheng-Kung University
^*Corresponding Author
ACM SIGGRAPH 2025

Abstract

The stories and characters that captivate us as we grow up shape unique fantasy worlds, with images serving as the primary medium for visually experiencing these realms. Personalizing generative models through fine-tuning with theme-specific data has become a prevalent approach in text-to-image generation. However, unlike object customization, which focuses on learning specific objects, theme-specific generation encompasses diverse elements such as characters, scenes, and objects. Such diversity also introduces a key challenge: how to adaptively generate multi-character, multi-concept, and continuous theme-specific images (TSI). Moreover, fine-tuning approaches often come with significant computational overhead, time costs, and risks of overfitting. This paper explores a fundamental question: Can image generation models directly leverage images as contextual input, similarly to how large language models use text as context? To address this, we present IP-Prompter, a novel training-free TSI method for generation. IP-Prompter introduces visual prompting, a mechanism that integrates reference images into generative models, allowing users to seamlessly specify the target theme without requiring additional training. To further enhance this process, we propose a Dynamic Visual Prompting (DVP) mechanism, which iteratively optimizes visual prompts to improve the accuracy and quality of generated images. Our approach enables diverse applications, including consistent story generation, character design, realistic character generation, and style- guided image generation. Comparative evaluations against state-of-the-art personalization methods demonstrate that IP-Prompter achieves significantly better results and excels in maintaining character identity preserving, style consistency and text alignment, offering a robust and flexible solution for theme-specific image generation.

Visual Prompting

Pipeline of IP-Prompter: Dynamic visual prompting (DVP) includes three key stages: (1) Comprehending user intent and extracting key elements; (2) Matching and generating visual prompts; and (3) Updating and evaluating prompts through self-consistency. This way (4) DVP enables effortless transition between diverse creative subjects, thereby enhancing the flexibility and efficiency of content generation.

Theme-specific images (TSIs) are visuals that combine characters, objects, and environments under a unified style or story. They are essential for branding, storytelling, and design.

While text-to-image models have advanced, generating TSIs is harder than customizing a single object — it requires managing multiple elements like characters and backgrounds. Existing methods (fine-tuning, control networks, attention exchange) are time- or data-costly and may break character consistency.

Inspired by large language models that flexibly use context, we bring the idea of text prompt engineering into image generation. We introduce a visual prompting framework that forms guiding images into an image grid directly as the input — no extra training or model changes needed. This keeps guidance accurate, efficient, and visually consistent.

To improve flexibility, we propose Dynamic Visual Prompting (DVP):

It analyzes user text, matches visual elements, self-updates the grid arrangements, and evaluates them automatically.
It can switch between characters and themes seamlessly based on user instructions.

Our full system, called IP-Prompter, achieves state-of-the-art TSI generation:

Training-free, modification-free, and highly flexible.
Supports creative tasks like story generation, character design, and style-guided creation.
Generates strong results with 15 diverse images per character and excels with 30.
The matching and evaluation process take about 10 seconds. The synthesis process takes about 30 seconds for a 512x512 image, which is comparable with baseline training-free methods

Dynamic Visual Prompting