Cohere's latest Aya Vision AI model is top-notch

Cohere For AI, AI startup Cohere’s nonprofit research lab, recently unveiled a cutting-edge “open” AI model called Aya Vision. Aya Vision is a versatile model capable of tasks like writing image captions, answering questions about photos, translating text, and generating summaries in 23 major languages. Cohere has made Aya Vision accessible for free through WhatsApp, marking a significant step towards making technical breakthroughs more widely available to researchers worldwide.

Aya Vision comes in two versions: Aya Vision 32B and Aya Vision 8B. Both versions have shown impressive performance, with Aya Vision 32B outperforming larger models like Meta’s Llama-3.2 90B Vision on certain visual understanding benchmarks. Aya Vision 8B also excels, scoring better than models ten times its size on some evaluations. These models are available through the AI dev platform Hugging Face under a Creative Commons 4.0 license with Cohere’s acceptable use addendum, although they cannot be used for commercial applications.

Cohere’s use of synthetic annotations, generated by AI, during the training of Aya Vision is a notable trend in the industry. By training on synthetic annotations, Cohere was able to achieve competitive performance while using fewer resources, showcasing their focus on efficiency and supporting the research community. In addition to Aya Vision, Cohere also released a benchmark suite called AyaVisionBench, aimed at evaluating a model’s skills in vision-language tasks. This suite seeks to address the “evaluation crisis” in the AI industry, providing a challenging framework for assessing a model’s cross-lingual and multimodal understanding.

Overall, Cohere’s efforts with Aya Vision and AyaVisionBench represent a positive step towards advancing multilingual multimodal evaluations and fostering innovation in the AI research community.