OpenAI Has Entered the Video AI World!
Plus: Google announces Gemini 1.5 with 1M tokens, Meta’s V-JEPA.
Hello Engineering Leaders and AI Enthusiasts!
Welcome to the 212th edition of The AI Edge newsletter. This edition brings OpenAI’s text-to-video model, Sora.
And a huge shoutout to our amazing readers. We appreciate you😊
In today’s edition:
🚀
OpenAI launches Sora, a text-to-video model
🌟 Google announces Gemini 1.5 with 1 million tokens!
🤖 Meta’s V-JEPA: A step toward advanced machine intelligence
📚 Knowledge Nugget: How do I evaluate LLM coding agents? by
Let’s go!
OpenAI launches Sora, a text-to-video model
Out of nowhere, OpenAI drops a video generation model. Sora can create 1-minute videos from text or a still image while maintaining visual quality and adherence to the user’s prompt. It can also “extend” existing video clips, filling in the missing details.
Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. It understands not only what the user has asked for in the prompt but also how those things exist in the physical world.
Sora is currently in research preview, and OpenAI is working with red teamers who are adversarially testing the model.
Why does this matter?
OpenAI has entered the video generation race with Runway, Pika, and more and might completely change it (probably for the better). Its cheery-picked samples do look quite impressive compared to others. But its true significance may be in the research beyond.
Sora builds on past research in DALL·E and GPT models, giving OpenAI an edge. Sora serves as a foundation for models that can understand and simulate the real world, a capability OpenAI believes will be an important milestone for achieving AGI.
Google announces Gemini 1.5 with 1 million tokens!
After launching Gemini Advanced last week, Google has now launched Gemini 1.5. It delivers dramatically enhanced performance, with a breakthrough in long-context understanding across modalities. It can process up to 1 million tokens consistently!
Gemini 1.5 is more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture. [In simple terms, the MoE approach is like sifting through only relevant training bits for faster, more focused answers to queries.]
Gemini 1.5 Pro comes with a standard 128,000 token context window. However, a limited group of developers and enterprise customers can try it with 1 million tokens via AI Studio and Vertex AI in private preview.
Why does this matter?
Google has achieved the longest context window of any large-scale foundation model yet. More information in a prompt means more consistent, relevant, and useful output. A million tokens mean huge possibilities for devs– upload hundreds of pages of text, entire code repos, and long videos and let Gemini reason across them. It can probably learn a whole new skill with just a prompt!
Meta’s V-JEPA: A step toward advanced machine intelligence
V-JEPA is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space. Unlike generative approaches that try to fill in every missing pixel, V-JEPA has the flexibility to discard unpredictable information, which leads to improved training and sample efficiency by a factor between 1.5x and 6x.
V-JEPA (Video Joint Embedding Predictive Architecture) is a method for teaching machines to understand and model the physical world by watching videos. With a self-supervised approach for learning representations from video, V-JEPA can be applied to various downstream image and video tasks without adaption of the model parameters.
Why does this matter?
Forgetting pixel-perfect predictions, V-JEPA takes a "smart skipping" approach by prioritizing understanding over detail. It mimics much of how humans learn. This shift to a more "grounded" understanding paves the way for AI that can reason, plan, and tackle complex tasks just like us. It opens doors to countless applications yet to be imagined.
Enjoying the daily updates?
Refer your pals to subscribe to our daily newsletter and get exclusive access to 400+ game-changing AI tools.
When you use the referral link above or the “Share” button on any post, you'll get the credit for any new subscribers. All you need to do is send the link via text or email or share it on social media with friends.
Knowledge Nugget: How do I evaluate LLM coding agents?
The capabilities of coding agents built using LLMs have evolved from auto-completing a few chunks of code to generating code to creating entire repositories (repos). Here are the most common evaluation benchmarks used to measure the accuracy of the base LLMs on coding tasks.
HumanEval: Hand-written code evaluation dataset by OpenAI
MBPP: Mostly Basic Python Problems Dataset from Google Research
MultiPL-E or Multilingual Human Eval: Translations of HumanEval from Python to 18 programming languages via mini-compilers
While these focus on the accuracy of basic coding tasks, but they may not be representative of real-world use cases. In his article,
discusses these benchmarks, general-purpose agents that can solve a wider range of tasks, fine-tuning, and the challenges of evaluating coding agents.Why does this matter?
LLM-based coding agents are evolving rapidly, but evaluating their performance requires careful consideration and potentially custom approaches. Benchmarks are still emerging and task-specific evaluation might be needed, while tools like Log10 can help with custom evaluations and fine-tuning.
What Else Is Happening❗
🍏Apple plans to launch an AI-based code completion tool to rival GitHub Copilot.
Apple has been working on the tool for the past year as part of the next major version of Xcode, Apple’s flagship programming software. It has now expanded testing of the features internally and has ramped up development ahead of a plan to release it to third-party software makers as early as this year. (Link)
🛡️Google announces free AI cyber tools to strengthen online security.
Google will introduce a new open-source resource powered by AI that utilizes file type identification to help detect malware. The tool, which is already being used to protect products including Gmail and Google Drive, will be made available for free. (Link)
🔍OpenAI is reportedly developing AI web search to directly compete with Google.
The service is said to be partly powered by Microsoft's Bing search. However, it is unclear if it will be a standalone search product separate from ChatGPT, which already has Bing integrated. The product could also be linked to an AI agent that independently performs tasks on the web. (Link)
💰Microsoft pledges $3.44bn for Germany's AI industry amid economic challenges.
Microsoft will invest 3.2bn euros in Germany in the next 2 years. With its biggest investment in Germany in 40 years, Microsoft aims to double the capacity of its AI and data center infrastructure in the country and expand its training programs. Germany is Europe's largest economy, facing its worst slump in 20 years. (Link)
🎓Penn Engineering launches first Ivy League undergraduate major degree in AI.
The University of Pennsylvania School of Engineering and Applied Science announced the launch of a Bachelor of Science in Engineering in AI degree. It is also one of the very first AI undergraduate engineering programs in the U.S. It will produce engineers who can leverage this powerful technology in a way that benefits all humankind. (Link)
New to the newsletter?
The AI Edge keeps engineering leaders & AI enthusiasts like you on the cutting edge of AI. From machine learning to ChatGPT to generative AI and large language models, we break down the latest AI developments and how you can apply them in your work.
Thanks for reading, and see you tomorrow. 😊