Scale AI Scrutinizes AI Benchmarks
Plus: Ukraine debuts world’s first AI diplomat, Sam Altman’s stance on the future of AI.
Hello Engineering Leaders and AI Enthusiasts!
Welcome to the 267th edition of The AI Edge newsletter. This edition features AI benchmarks scrutiny.
And a huge shoutout to our amazing readers. We appreciate you😊
In today’s edition:
📊 How much do LLMs overfit public benchmarks?
🤖 Ukraine debuts world’s first AI diplomat
🔮 Sam Altman’s stance on the future of AI
📚 Knowledge Nugget: How to streamline your writing process with Whisper and GPT-4 by
Let’s go!
How much do LLMs overfit public benchmarks?
A new study by Scale AI raises concerns about the reliability of LLM benchmark tests. It uncovers LLM overfitting by evaluating them on a new (designed from scratch) dataset, GSM1k that mimics a popular benchmark, GSM8k.
Key findings:
Overfitting: Many LLMs performed significantly worse on GSM1k compared to GSM8k, with some models dropping by as much as 13%. This suggests they've simply memorized the answers to benchmark problems rather than learning true reasoning skills.
Family Trends: Certain LLM families, particularly Mistral and Phi, showed consistent overfitting across different model sizes.
Frontier Models Shine: Newer, more advanced LLMs showed minimal signs of overfitting, suggesting they may be achieving genuine reasoning abilities.
Data Contamination Suspected: Analysis suggests data contamination from benchmark sets may be one factor contributing to overfitting.
Reasoning Still Present: Even overfitting models exhibited some capability to solve novel problems, although not at the level their benchmark scores suggested.
Overall, the study highlights the need for more robust and reliable methods for evaluating LLM reasoning abilities.
Why does it matter?
The dataset proves that overfitting may be creating major false impressions of model performance. As AI capabilities continue to advance, it is crucial to develop evaluation approaches that can keep pace and provide a more accurate picture of a model's real-world potential.
Ukraine debuts the world's first AI diplomat
Ukraine has deployed the world's first AI-generated digital spokesperson named Victoria Shi to deliver official statements on behalf of the country's Ministry of Foreign Affairs.
While the visual avatar is AI-generated, the statements will be written and verified by human diplomats. This move aims to save Ukrainian diplomats time and resources.
The main points about the AI diplomat are:
Victoria Shi's voice and tone are modeled after Rosalie Nombre, a Ukrainian singer and TV celebrity who participated free of charge.
Each statement read by Shi will include a unique QR code linking to the official text on the Ministry's website to combat deepfake issues.
Shi was created by a team called The Game Changers, who previously made content related to the war in Ukraine.
Why does this matter?
Ukraine's AI diplomat marks the beginning of a trend that could have far-reaching consequences for international relations. As other countries potentially follow suit, navigating ethical considerations and ensuring transparency and accountability in the use of AI in diplomacy will be crucial.
Sam Altman’s stance on the future of AI
During a recent appearance at Stanford University, Altman talked about the future of AI, calling GPT-4, a currently impressive AI model, to be the “dumbest model” compared to future iterations. According to Altman, the future is dominated by "intelligent agents," AI companions that can not only follow instructions but also solve problems, brainstorm solutions, and even ask clarifying questions.
OpenAI isn't just talking about the future, they're actively building it. Their next-generation model, GPT-5, is rumored for a mid-2024 release and might boast video generation capabilities alongside text and image.
But the real moonshot is their active participation in developing AGI.
Despite the significant costs involved, Altman remains undeterred. He believes that the potential benefits, such as solving complex problems across various industries, outweigh the financial burden.
Watch the whole Q&A session here.
Why does this matter?
Altman’s bold comments on GPT-4 being the dumbest model suggest that OpenAI is aiming for something even grander, and GPT-5 could be a stepping stone toward it (the next-gen AI framework).
Enjoying the daily updates?
Refer your pals to subscribe to our daily newsletter and get exclusive access to 400+ game-changing AI tools.
When you use the referral link above or the “Share” button on any post, you'll get the credit for any new subscribers. All you need to do is send the link via text or email or share it on social media with friends.
Knowledge Nugget: How to streamline your writing process with Whisper and GPT-4
In his recent newsletter post,
talks about how he uses AI to streamline his writing process and go from loose ideas to a first draft in minutes. While he doesn't have ChatGPT write his articles, he uses it as part of his workflow:Records a voice memo on his phone and transfers it to his laptop
Transcribes the audio using the Whisper API
Edits the transcription using the ChatGPT API, dealing with context window limitations
The author provides code snippets for transcribing audio with Whisper and editing text with GPT-4. He notes some limitations of Whisper, like not identifying slang/proper nouns well and outputting a single text block. To address the context window issue, he uses the Tiktoken library to split text into chunks that fit within the token limit.
Why does this matter?
AI dictation, like Charlie's, empowers everyone to create content, making it accessible to those who once struggled with writing.
What Else Is Happening❗
🤖 OpenAI prepares to challenge Google with ChatGPT-powered search: OpenAI is building a search engine, search.chatgpt.com, potentially powered by Microsoft Bing. This leverages their existing web crawler and Bing's custom GPT-4 for search, posing a serious threat to Google's dominance. (Link)
🚫 Microsoft bans U.S. police use of Azure OpenAI for facial recognition
Microsoft has banned U.S. police from using Azure OpenAI Service for facial recognition, including integrations with OpenAI's image-analyzing models. The move follows Axon's controversial GPT-4-powered tool to summarize audio from the body camera. However, the ban has exceptions and doesn't cover Microsoft's other AI law enforcement contracts. (Link)
🌐 IBM expands AI and data software on AWS marketplace
IBM has significantly expanded its software offerings on the AWS Marketplace, making 44 products accessible to customers in 92 countries, up from just five. The move, part of a strategic collaboration with AWS, focuses on AI and data technologies like Watson x.data, Watson x.ai, and the upcoming Watson x.governance. (Link)
🔒 Google Cloud supports Azure and AWS; integrates AI for security
Google Cloud now supports Azure and AWS, enabling enterprises to manage security across multi-cloud platforms. AI integration with existing solutions streamlines user experience and addresses the security talent gap. The AI-powered design manages risks efficiently amid increasing cyber threats, while extensive support simplifies tasks for enterprises. (Link)
💸 Microsoft invests $2.2B in Malaysia's cloud and AI transformation
Microsoft is investing $2.2 billion over the next four years to support Malaysia's digital transformation, its largest investment in the country's 32-year history. The investment includes building cloud and AI infrastructure, creating AI skilling opportunities for 200,000 people, establishing a national AI Centre of Excellence, enhancing cybersecurity capabilities, and supporting the growth of Malaysia's developer community. (Link)
New to the newsletter?
The AI Edge keeps engineering leaders & AI enthusiasts like you on the cutting edge of AI. From machine learning to ChatGPT to generative AI and large language models, we break down the latest AI developments and how you can apply them in your work.
Thanks for reading, and see you tomorrow. 😊