"Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we’ve tried have made a significant difference on either our internal benchmarks or in our developers’ ability to find new bugs. This includes the new test-time OpenAI models.
At first, I was nervous to report this publicly because I thought it might reflect badly on us as a team. Our scanner has improved a lot since August, but because of regular engineering, not model improvements. It could’ve been a problem with the architecture that we had designed, that we weren’t getting more milage as the SWE-Bench scores went up.
But in recent months I’ve spoken to other YC founders doing AI application startups and most of them have had the same anecdotal experiences: 1. o99-pro-ultra announced, 2. Benchmarks look good, 3. Evaluated performance mediocre. This is despite the fact that we work in different industries, on different problem sets. Sometimes the founder will apply a cope to the narrative (“We just don’t have any PhD level questions to ask”), but the narrative is there.
I have read the studies. I have seen the numbers. Maybe LLMs are becoming more fun to talk to, maybe they’re performing better on controlled exams. But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues’ perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality."
#AI #GenerativeAI #LLMs #Chatbots #CyberSecurity #SoftwareDevelopment #Programming
This tracks with my own observations.