Alibaba researchers say their Metis agent, trained with HDPO reinforcement learning, cuts redundant tool use from 98% to 2% by teaching accuracy and efficiency as separate learning signals. The approach targets “trigger-happy” behavior that slows agents, inflates API costs, and injects noisy context. Metis also reaches top-tier reasoning and visual-document performance across benchmarks.
Swipe through stories, personalise your feed, and save articles for later — all on the app.