RoskerTech

Apple unveils the AI model MM1

General

Apple is something of a latecomer in the field of large-scale language modeling (LLM), lagging behind Google, Microsoft, and Meta in the development of powerful AI tools, but it appears to be catching up fast

Earlier this year, CEO Tim Cook told investors that he had an important announcement to make regarding AI that would be a "major breakthrough" Many suspect this will be a new version of Siri with LLM, similar to how Google replaced its assistant with Gemini

Apple researchers have just revealed details of what could be the basis for this next generation Siri, and if the rumors are true, it could work alongside Gemini on the iPhone, offering choices

MM1, presented as a preprint research paper, essentially offers a new way to speed up the training of new models (possibly including Siri 20) using AI-generated data and labels

The core of MM1 is a new method for training multimodal models using synthetic data, including images and text

The researchers behind MM1 claim that their new method speeds up performance and reduces the number of follow-up prompts to obtain desired results

Being able to improve the comprehension of the prompts and reach the desired output with as little interaction with the AI as possible is perfect for consumer technology, especially Siri, which is used by a wide group with varying degrees of technical proficiency

MM1 appears to be a family of AI models, the largest having about 30 billion parameters While this is considerably smaller than the trillion-plus parameters of GPT-4 and Claude 3 Opus, researchers still claim that the increased efficiency makes it comparable to the major benchmarks

"By scaling up the recipe, they built MM1, a multimodal model family with up to 30 billion parameters that achieves state-of-the-art pre-training metrics and performs competitively on multimodal benchmarks after fine tuning," they write

A key breakthrough is the ability to understand the analysis and output of visuals, especially images and other forms of visual content I recently tested how well ChatGPT, Claude, and Gemini perform on this task [The title of the paper is "Methods, Analysis and Insights from Multimodal LLM Pre-training It was quietly released with minimal fanfare and is available in open source, along with full details of the training data and benchmarks

In it, researchers argue that state-of-the-art performance can be achieved by combining different types of training data and model architectures, rather than relying on a single concept

The researchers write that a combination of image-caption, image-text, and text-only data requires "diverse datasets spanning visual and linguistic information" to achieve that performance

This includes image captioning, visual question answering, and natural language understanding (eg, getting the desired output for a one-shot or several-shot prompt) [Thanks to extensive prior learning, MM1 enjoys attractive properties such as enhanced in-context learning and multiple image inference, allowing for several-shot thought-chain prompts," the team explained

MM1 uses a different architecture than other models, including higher image resolution encoders, takes a different approach to pre-training and labeling, and focuses on using data mixing to improve overall performance from a single prompt

It also uses a mixing-of-experts (MoE) model to scale up with lower processing requirements, suggesting the possibility of using it on devices such as iPhones and laptops rather than running in the cloud

Google recently leveraged the MoE architecture in its Gemini 15 Pro model with a context window of over 1 million tokens This allowed them to improve efficiency over cases with less input data

While the paper does not discuss Siri or potential products, the focus on performance and efficiency, getting reliable results with minimal prompting, and the need for extensive multimodal functionality suggests the direction Apple may be heading with Siri in the future

Many of Siri's features with LLM will likely have to be performed "on-device" due to Apple's long-standing privacy stance, particularly with respect to processing personal information

Being able to develop very powerful models, able to learn from user interaction, and small enough to run on an iPhone is a big move

With recent news that Apple may bring Gemini to the iPhone, and earlier talk that it is also in talks with OpenAI, makers of ChatGPT, Apple appears to be taking a multi-faceted approach to achieving the "big bang" Cook promised investors in AI