The Impending Release of OpenAI GPT-4: A Multimodal System That Should Worry Google.
“Multimodal GPT-4 Confirmed by Microsoft Germany CTO Andreas Braun for Release Within a Week of March 9, 2023. Its Capabilities Include Processing Video, Images, and Sound Inputs.”
CTO Andreas Braun
Exploring the Multimodal Capabilities of GPT-4
A Breakthrough in Large Language Models. Unlike its predecessors, GPT-3 and GPT-3.5 which only dealt with text, GPT-4 is set to operate in multiple modalities, including images, sound, text, and video. This advancement marks a significant milestone in the field of language processing and is a long-awaited development predicted by SEJ back in January 2023.
According to a statement from Dr. Andreas Braun, the CTO of Microsoft Germany:
“We will introduce GPT-4 next week, there we will have multimodal models that will offer completely different possibilities – for example videos…”
The reporting on GPT-4 lacked specific details, making it unclear whether the information regarding multimodality referred to the model in particular or was a general statement.
During the discussion, Microsoft’s Director of Business Strategy, Holger Kenn, elaborated on the concept of multimodality. However, it was unclear from the reporting whether he was specifically referring to the multimodal capabilities of GPT-4 or multimodality in general.
In my opinion, Kenn’s comments were likely directed towards GPT-4’s specific multimodal capabilities.
According to the news report
Kenn provided an explanation of multimodal AI, which involves translating text not only into images but also into music and video.”
Another notable development is that Microsoft is actively working on incorporating ‘confidence metrics’ into their AI models to enhance their reliability and grounding them with factual information.
Furthermore, it appears that Microsoft released a new multimodal language model named Kosmos-1 at the start of March 2023. Surprisingly, this news received little attention in the United States. As per the reporting by the German news site Heise.de:”
“…..
“The team conducted several tests on the pre-trained Kosmos-1 model, which included image classification, automated image labeling, optical text recognition, and speech generation tasks. The model performed well on these tests, particularly in tasks that involved visual reasoning, i.e., deriving conclusions from images without relying on language.
Kosmos-1 is a multimodal model that combines text and images as input modalities.
While Kosmos-1 represents a significant development, GPT-4 takes it one step further by incorporating a third modality, video, and potentially even sound.”
Conclusion
GPT-4 is reported to be capable of working across all languages, allowing it to receive questions in one language and provide answers in another. This breakthrough is possible due to the model’s ability to extract knowledge from multiple languages and use it to provide answers in the language in which the question was asked.
Similar to Google’s multimodal AI, MUM, GPT-4 has the potential to provide answers in one language for which data only exists in another language.
At present, there are no official announcements regarding where GPT-4 will be deployed. However, it is known that Azure-OpenAI is one potential application.
Unlike Google, which incorporates AI into a variety of consumer-facing products, Microsoft’s implementation of AI is more prominent and visible. As a result, it is drawing attention and potentially exacerbating the perception that Google is falling behind in the race for consumer-facing AI technology.
The German reporting can be found here: