Cohere Launches Aya Vision

Cohere has launched Aya Vision, a multimodal AI model capable of writing captions, answering questions, and translating texts in 23 languages. Available for free on WhatsApp, it aims to bridge the performance gap between different languages. It includes two versions, 32B and 8B, with competitive per

Cohere Launches Aya Vision

Cohere, an artificial intelligence startup, recently released a multimodal AI model called Aya Vision, which has been described as best-in-class. This model can perform various functions, including writing image captions, answering questions about photos, translating text, and generating summaries in 23 major languages. Cohere has made Aya Vision available for free via WhatsApp, emphasizing the importance of making technical innovations accessible to researchers worldwide.

Cohere highlighted that, despite significant advancements in AI, there remains a substantial gap in model performance across different languages. This gap is particularly evident in multimodal tasks that involve both text and images. Aya Vision is designed to help bridge this gap. The model is available in two versions: Aya Vision 32B and Aya Vision 8B. The 32B version is considered the more advanced and outperforms models twice its size, such as Meta's Llama-3.2 90B Vision, on certain visual understanding benchmarks. Meanwhile, Aya Vision 8B has scored better than models ten times its size in some evaluations.

Both models can be downloaded from the AI development platform Hugging Face under a Creative Commons 4.0 license, with Cohere's acceptable use addendum. However, they cannot be used for commercial applications. Aya Vision was trained using a diverse pool of English datasets, which were translated and used to create synthetic annotations. Annotations help models understand and interpret data during the training process. For instance, annotations for training an image recognition model might include markings around objects or captions referring to each person, place, or object depicted in an image.

Cohere has taken an innovative approach by using synthetic annotations generated by AI for training Aya Vision. This approach aligns with current trends in the industry, where companies like OpenAI are increasingly leveraging synthetic data to train models as the availability of real-world data declines. According to Gartner, 60% of the data used for AI and analytics projects last year was synthetically created. Cohere stated that using synthetic annotations allowed them to reduce the resources needed while achieving competitive performance. This approach aims to provide greater support for the research community, which often has limited access to computing resources.

Alongside Aya Vision, Cohere also launched a new benchmark suite called AyaVisionBench, designed to assess a model's skills in vision-language tasks, such as identifying differences between two images and converting screenshots to code. The AI industry is facing what some have termed an evaluation crisis, due to the popularity of benchmarks that provide aggregate scores that poorly correlate with proficiency in tasks most AI users care about. Cohere asserts that AyaVisionBench is a step toward rectifying this, providing a broad and challenging framework for assessing a model's cross-lingual and multimodal understanding.