Welcome to Tech Beat, your daily read on the technology stories that matter.
OpenAI has quietly retired SWE-bench Verified as a benchmark for measuring frontier coding capabilities, and the reasoning is telling. The benchmark, once a gold standard for evaluating AI coding agents, has essentially been gamed into irrelevance β models have improved so fast that the tests no longer distinguish between them. It raises a familiar question in AI evaluation: what happens when the yardstick keeps breaking?
That tension between AI capability and real-world reliability surfaces in a different way on Hacker News today, where a shared ChatGPT conversation is making the rounds under the blunt title "GPT cannot even count beans correctly." It's a small moment, but it keeps landing because it captures something genuine β that even as benchmarks fall, basic reasoning errors persist in ways that erode trust for everyday users.
Meanwhile, GitHub has frustrated a vocal portion of its developer community by changing how issue links behave β they now open in a popup rather than navigating to the full page. It sounds minor, but developers who live inside GitHub all day are pushing back hard, with dozens weighing in on the community discussion thread. Sometimes the smallest UX decisions carry the loudest reactions.
Keep surfing. Tech Beat out.["https://www.tomshardware.com/pc-components/save-over-usd150-on-this-fantastic-elecoo-resin-3d-printer-with-16k-resolution-and-a-tilting-vat-saturn-4-ultra-16k-is-on-sale-for-just-usd493-right-now-on-amazon","https://news.ycombinator.com/item?id=47911916","https://chatgpt.com/share/69ee4690-60ac-83ea-b28c-f4ce6284a75a","https://www.techradar.com/home/do-you-have-a-kitchen-graveyard-of-broken-appliances-heres-how-to-care-for-your-gadgets-and-keep-them-working-longer","https://github.com/orgs/community/discussions/192666","https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/"]
πΊ Tech Beat Β· 6 PM Update Β· player loadingβ¦