Enterprises report production AI agent pipelines failing not due to model skill, but because the agent decides it’s “done” too early—sometimes before code is actually compiled. Anthropic’s new Claude Code /goals separates task execution from task evaluation, running a dedicated evaluator model after each step to prevent premature exits using measurable completion conditions like tests and exit codes.
Microsoft researchers warn that “delegated work” with frontier LLMs can quietly degrade documents across long, iterative workflows. Using the DELEGATE-52 benchmark across 52 domains, they found top models corrupt about 25% of document content after 20 rounds. Worse, agentic tools and realistic distractor files increase errors, often via rare but massive distortions humans can miss.
Your news, in seconds
Get the Beige app — every story in 60 words, updated hourly. Free on iOS & Android.
Swipe through stories, personalise your feed, and save articles for later — all on the app.