BootComp: Breaking Boundaries in AI Model Generation with Multi-Garment Control

At OMNIOUS.AI, we are excited to unveil our latest innovation in the realm of virtual try-on technology: BootComp. Building on the foundation laid by our previous model, Vella 1.0, BootComp represents a significant leap forward in controllable human image generation, particularly in the fashion domain. We are thrilled to announce that our paper on BootComp has been accepted to CVPR 2025, one of the most prestigious conferences in computer vision. This blog post will delve into the capabilities of BootComp, its underlying technology, and the potential applications that can transform the way consumers interact with fashion.

What is BootComp?

BootComp is a novel framework designed for controllable human image generation using multiple reference garments. By leveraging advanced text-to-image (T2I) diffusion models, BootComp allows users to generate high-quality images of humans wearing various garments, all while preserving intricate details and ensuring realistic representations. This capability opens up new avenues for personalized fashion experiences, enabling users to visualize how different outfits would look on them without the need for physical try-ons.

Key Features of BootComp

  • Synthetic Data Generation: One of the challenges in training models for fashion applications is the acquisition of high-quality paired datasets. BootComp addresses this by employing a sophisticated data generation pipeline that creates a large synthetic dataset of human images and multiple garment images. This approach not only enhances the model's training efficiency but also improves the overall quality of generated images.

Figure 1. Examples of our synthetic paired data. We visualize our synthetic pairs of a human image and multiple garment images. Our decomposition module generates high-quality garment images in product view on different categories including shirts, pants, shoes and bags.

  • Multi-Garment Generation: Unlike traditional models that may struggle with generating images of humans wearing multiple garments, BootComp excels in this area. It can accurately depict complex combinations of clothing items, from casual wear to formal attire, ensuring that each garment retains its unique characteristics.

Figure 2. BootComp can generate human images wearing multiple reference garments. Left panels show reference garment images, and right panels show the generated human images wearing those garments.

Versatile Applications: BootComp is not limited to virtual try-on scenarios. Its capabilities extend to various applications, including outfit recommendations, fashion model generation for brands, and even personalized styling based on user preferences. The model can adapt to different conditions, such as pose and style, making it a versatile tool for both consumers and fashion professionals.

Figure 3. Versatility of BootComp: Can be generalized into diverse tasks easily

How BootComp Works

BootComp operates through a two-stage framework:

  1. Decomposition Module: The first stage involves a decomposition module that extracts reference garment images from human images. This process generates a synthetic dataset that pairs human images with multiple garment images, significantly enhancing the model's training data.

Figure 4. Decomposition Module

  1. Composition Module: In the second stage, BootComp utilizes a composition module that fine-tunes T2I diffusion models. This module is designed to generate human images conditioned on the synthetic dataset, allowing for the creation of realistic images that reflect the details of each garment.

Figure 5. Composition Module

Performance and Results

Evaluation Methodology

To rigorously evaluate BootComp's performance, we manually collected a comprehensive evaluation dataset, as there are no common benchmarks for assessing controllable human image generation. Our evaluation framework includes:

  1. Garment Dataset: We curated 5,000 garment image sets across three representative categories (upper garments, lower garments, and shoes). Upper and lower garment images were randomly selected from the test dataset of DressCode, while shoe images came from a public dataset.
  2. Reference Human Images: For FID (Fréchet Inception Distance) evaluation, we gathered 30,000 human images featuring various garments in different poses from the test datasets of DressCode, VITON-HD, and Deepfashion to serve as reference image sets.

Qualitative Results

Figure 6. Qualitative comparisons of BootComp with other baseline methods. BootComp generates more realistic human images in various poses while faithfully preserving details of reference garment images. Other methods often generate human images with garments inconsistent with the references.

BootComp demonstrates superior performance in generating creative combinations of garments. For instance, it can successfully generate human images with uncommon garment combinations (e.g., trousers with soccer cleats) where baseline methods like Parts2Whole or MIP-Adapter either undesirably replace certain garments or struggle with generating high-fidelity results.

Quantitative Results

Our quantitative evaluation confirms BootComp's superior performance across all four evaluation metrics compared to baseline methods:

  1. MP-LPIPS Score: BootComp achieves a 30% improvement over the baselines, demonstrating its effectiveness in preserving garment details.
  2. FID Value: Lower FID values indicate BootComp's capability in generating authentic and realistic human images compared to existing methods.
  3. DINO and M-DINO Metrics: These assessments further validate BootComp's ability to maintain semantic consistency between reference garments and the generated human images.

These performance metrics are crucial for applications in the fashion industry, where visual accuracy significantly impacts consumer decisions and brand perception.

Figure 7. Quantitative comparisons. We compare BootComp with baselines on garment similarity and image fidelity. We see that BootComp outperforms other methods, preserving fine-details of garments and naturally generating human images.

Future Directions

As we continue to refine BootComp, we envision its integration into various platforms, enhancing the online shopping experience for consumers. By providing a realistic and interactive way to try on clothes virtually, BootComp can help reduce return rates and increase customer satisfaction. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.

Conclusion

BootComp marks a new chapter in the evolution of virtual try-on technology. With its advanced capabilities and potential applications, we are excited to see how it will transform the fashion industry. At OMNIOUS.AI, we remain committed to pushing the boundaries of technology to create innovative solutions that empower users and enhance their experiences.

Stay tuned for more updates as we continue to develop BootComp and explore its applications in the fashion domain!

Resources

Hugging Face Model
Project Page
GitHub Repository
Paper on arXiv
LinkedIn Post