Abstract: Advanced diffusion models have made notable progress in text-to-image compositional generation. However, it is still a challenge for existing models to achieve text-image alignment when ...
Abstract: Current text-only image captioning methods leverage the shared feature space of CLIP to train zero-shot image captioning using text data only, leaving feature associations and contextual ...