AI and the Sunday Puzzle

NPR's Sunday Puzzle is used to test AI reasoning abilities, revealing that some models give up and provide incorrect answers. Researchers aim to improve these models through an accessible benchmark.

AI and the Sunday Puzzle

Every Sunday, Will Shortz, the New York Times crossword expert, hosts a segment called Sunday Puzzle on NPR. This segment engages thousands of listeners with riddles designed to be solvable without specific knowledge, although they are often challenging. A group of researchers has used these riddles to test AI problem-solving abilities, discovering that reasoning models like o1 sometimes 'give up' and provide incorrect answers. Arjun Guha, one of the study's authors, explained that the goal was to develop a benchmark with problems understandable with only general knowledge. The AI industry is in a tricky situation regarding tests, as many commonly used ones are not relevant to the average user. The Sunday Puzzle's advantage is that it does not require esoteric knowledge, and the problems are phrased in a way that does not allow for 'rote memory' use. Guha emphasized that the problems become difficult because it is hard to make significant progress until the problem is solved. This requires a combination of insight and a process of elimination. Although benchmarks are not perfect, the Sunday Puzzle offers new questions every week to keep the test updated. Reasoning models like o1 and DeepSeek's R1 outperform others, as they thoroughly fact-check their answers. However, R1 has given wrong answers for some riddles and exhibited curious behaviors, such as retracting incorrect answers. Guha found it amusing to see a model express frustration like a human. The current best model is o1 with a score of 59%, followed by o3-mini at 47%. The researchers plan to expand testing to other reasoning models to identify how they can be improved. Guha concluded that you don't need a PhD to be good at reasoning, and accessible reasoning benchmarks can lead to better outcomes in the future.