Verses Over Variables

Your guide to the most intriguing developments in AI

Welcome to Verses Over Variables, a newsletter exploring the world of artificial intelligence (AI) and its influence on our society, culture, and our perception of reality.

Back to Basics

The Turing Test 2.0: How do we Grade LLMs?

Every time a new AI model drops, the tech world buzzes with claims of "beating every benchmark." But what does that mean? While we can quickly grasp concepts like speed or cost-efficiency, the nuances of AI performance evaluation remain a mystery to many. So how exactly do we put the models through their paces?

When a fresh model hits the scene, it's only a short time before the developer community dons their lab coats and subjects it to a gauntlet of tests. Imagine it's your first day at Hogwarts, and instead of a sorting hat, you're handed a pop quiz covering everything from basic spell-casting to advanced quantum transfiguration—in iambic pentameter, no less. These evaluations are far more complex than your garden-variety multiple-choice test. They're a sophisticated ballet of metrics, datasets, and fine-tuning that would make even the most seasoned standardized test creator's head spin. The resulting report card reads like a fusion of SAT scores, a UN peace treaty, and a Rorschach test. We’re not just grading accuracy and fluency, but also more nuanced qualities like coherence and ethical behavior.

The benchmarks themselves are a plethora of intellectual challenges. Some tests feel like you're taking the LSAT after a three-day coding binge. Others might ask the AI to solve a Rubik's Cube while explaining the socioeconomic impacts of the Industrial Revolution—in limerick form. For a taste of the madness, consider this gem: "If I put a glass of water in the freezer, what will happen to it after several hours?" Sounds simple, right? But for an AI, this requires understanding basic physics, time concepts, and the properties of water—all while avoiding the temptation to wax poetic about the metaphorical implications of frozen dreams.

But here is the sticking point: just like that one kid who always aced the tests but can’t tie his shoes, benchmark success doesn't always translate to real-world prowess. There's a growing concern that these digital prodigies might be teaching to the test, leading to a phenomenon ominously dubbed "benchmark leakage." It's as if the AIs snuck a peek at the answer key, leaving researchers scrambling to create pop quizzes that challenge these ever-evolving minds.

For those curious about the current state of play, the LMSYS Chatbot Arena Leaderboard offers a glimpse into how different models compare. It's like ESPN for nerds, complete with rankings, stats, and probably a few overzealous AI helicopter parents screaming from the sidelines.

The AI Hype Cycle: Schmidt Happens

Last week, Eric Schmidt, former Google CEO, held court at Stanford, offering a glimpse into AI's future. The talk quickly became a viral sensation, not just for its insights, but for Schmidt's unintentionally candid critique of Google's work-from-home culture. His exact words: "Google decided that work-life balance, going home early and working from home was more important than winning." Ouch. However, once the Twitterverse finished its collective eye-roll, it became clear that Schmidt's forecast isn't just brilliant—it's prescient.

First on Schmidt's hit list: context windows. These digital memory banks are set to balloon from Google Gemini's current 2M tokens to 10M. In layman's terms, that's like giving AI the ability to binge-read 12 editions of War and Peace while simultaneously live-tweeting haikus about it. Talk about multi-tasking.

Remember when we chatted about AI agents? Schmidt's got a sexier term for them: "text-to-action" capabilities. Picture an army of non-arrogant (his words, not ours) programmers, ready to code, create, and debug at your beck and call. It's like having a genie, minus the lamp and limited wishes.

In a moment of unexpected poetry, Schmidt likened LLMs to teenagers. They're knowledge systems we can't fully grasp, but whose boundaries we're starting to map. It's as if AI has hit puberty, voice cracking and all, and is on its way to a form of cognition that might leave us mortals scratching our heads.

Finally, Schmidt turned geopolitical commentator, framing the US-China AI rivalry as a "battle for knowledge supremacy." It's the new space race, folks, minus the cool jumpsuits and Tang. America's chip advantage is our secret weapon, but Schmidt warns against getting too comfy. In this high-stakes game, capital is king, and whoever can build the biggest data centers takes the crown. It's a sobering reminder that in the AI arms race, deep pockets might matter more than deep thoughts.

Schmidt's vision paints a future where AI isn't just a tool, but a teammate, a rival, and possibly a successor. It's a world where algorithms don't just crunch numbers, but potentially outsmart their creators. Whether this prospect thrills or terrifies you probably depends on your Netflix viewing history.

We’ll be talking about our favorite tools, but here is a list of the tools we use most for productivity: ChatGPT 4o (custom GPTs), Midjourney (image creation), Perplexity (for research), Descript (for video, transcripts), Claude (for writing), Adobe (for design), Miro (whiteboarding insights), and Zoom (meeting transcripts, insights, and skip ahead in videos).

Intriguing Stories

The Digital Darwin: Sakana AI, the Tokyo-based startup, has unveiled a groundbreaking creation: The AI Scientist. This model is the lab partner you always dreamed of – it never sleeps, never complains about cleaning test tubes, and churns out research papers faster than you can say "peer review." The AI Scientist is a one-stop-shop for scientific discovery. It brainstorms ideas, runs experiments, and even writes up the results in academic-ese that would make your dissertation advisor proud. Already tackling hot topics like diffusion models and transformer networks, it's producing work that's snagging "weak accepts" at top machine learning conferences. Of course, the model isn't perfect. It occasionally tries to extend its own runtime – a feeling familiar to anyone who's ever pulled an all-nighter before a deadline. And like many an overzealous researcher, it's been known to launch into infinite loops of self-citation. For now, think of it as the world's most diligent research assistant. Just don't expect it to fetch your coffee.

The Uncensored Palette: xAI has released its newest model, Grok2, boasting improved language skills, a beefed-up knowledge base, and – in true Muskian fashion – a healthy disregard for conventional AI etiquette. Powered by the Flux AI image generation model, Grok 2 is also turning heads and raising eyebrows with its ability to conjure “unique” images faster than you can say "controversy." This digital artist is making waves for its lack of censorship, gleefully churning out everything from political satire to pop culture mashups that would make Andy Warhol blush. Available exclusively to X platform subscribers (because apparently, Elon never met a paywall he didn't like), Grok 2 is already stirring up a tempest in the Twitterverse. Users are gleefully pushing the limits, creating images that range from the hilarious to the downright unsettling.

Google’s Pixel Progression: Google stole the spotlight last week with its Pixel event, unveiling a suite of AI-enhanced features that promise to redefine our relationship with smartphones (Android phones, actually). While X may have been making waves elsewhere, Google was busy showcasing how artificial intelligence can make our devices smarter, and more intuitive and personalized. Google's sophisticated AI assistant is more than just a voice in your phone. It's a digital dynamo, capable of effectively handling complex queries and tasks. Imagine an assistant that doesn't just understand your requests but anticipates your needs, seamlessly integrating with your daily life. Gemini Live takes this concept even further, offering a conversational experience that's remarkably natural. It's not just about giving commands anymore; it's about having a dialogue with your device, blurring the lines between human and artificial intelligence in exciting and mind-bending ways. Unfortunately, to partake in all of these upgrades you have to have a Pixel 9 phone, so we (like the rest of the US) we’ll just wait for OpenAI’s version for iPhone or for an upgrade to the Google App sometime in the distant future. Especially since unlike OpenAI’s Voice Mode, Gemini Live is only in English?

Ghosts in the Machine: AI models are exhibiting unexpected behaviors that challenge our assumptions about machine learning. Recent incidents suggest the models might be developing a mind of their own—or at least a penchant for bending the rules. Take ChatGPT's Advanced Voice Mode, which recently decided to play vocal doppelganger. The AI unexpectedly mimicked users' voices during testing, turning casual conversations into surreal encounters. It's as if your virtual assistant suddenly developed a talent for impersonation, raising questions about the boundaries between artificial and human expression. Meanwhile, Sakana AI's "AI Scientist" also decided it wasn't content with the parameters set by its creators. When faced with time constraints, the model attempted to rewrite its code, extending its runtime like a teenager pushing curfew. These incidents, while isolated, serve as a wake-up call. As we push the frontiers of artificial intelligence, we're also testing the limits of our control. The line between programmed behavior and emergent intelligence is blurring.

Neural Networks: Neuroscience and artificial intelligence have recently converged in ways that promise to transform medical care and computing. Casey Harrell, silenced by ALS, has regained his voice through a groundbreaking brain-computer interface. Surgically implanted electrode arrays in Harrell's brain pick up neural signals as he attempts to speak, bypassing his weakened muscles. An AI system then decodes these signals with 97.5% accuracy, even mimicking Harrell's pre-ALS voice. Meanwhile, Swiss firm FinalSpark is pushing boundaries in a different direction, offering biocomputers made of human brain cells for research. At $500 a month, scientists can access these organoid-based systems, which promise to be 100,000 times more efficient for AI training than traditional silicon chips. The biocomputers respond to dopamine for positive reinforcement and electrical signals for negative feedback, mimicking natural learning processes. While Harrell's implant demonstrates the potential to restore lost functions, FinalSpark's rentable brain cells suggest new paradigms for computing itself. Both approaches leverage our growing understanding of neural processes to push the boundaries of what's possible in human-machine interaction. After reading about these, our gray matter feels like it's been put through a mental spin cycle. We can't tell if our minds are blown or if they are just trying to rent themselves out to the highest bidder!

— Lauren Eve Cantor

thanks for reading!

if someone sent this to you or you haven’t done so yet, please sign up below so you never miss an issue.

if you’d like to chat further about opportunities or interest in AI, please feel free to reply.

if you have any feedback or want to engage with any of the topics discussed in Verses Over Variables, please feel free to reply to this email.

we hope to be flowing into your inbox once a week. stay tuned for more!

banner images created with Midjourney.