Alibaba Qwen QwQ-32B: Scaled reinforcement learning showcase

The Qwen team at Alibaba has unveiled QwQ-32B, a 32 billion parameter AI model that demonstrates performance rivalling the much larger DeepSeek-R1. This breakthrough highlights the potential of scaling Reinforcement Learning (RL) on robust foundation models.
The Qwen team have successfully integrated agent capabilities into the reasoning model, enabling it to think critically, utilise tools, and adapt its reasoning based on environmental feedback.
“Scaling RL has the potential to enhance model performance beyond conventional pretraining and post-training methods,” the team stated. “Recent studies have demonstrated that RL can significantly improve the reasoning capabilities of models.”
QwQ-32B achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated), a testament to the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge. This remarkable outcome underscores the potential of RL to bridge the gap between model size and performance.
The model has been evaluated across a range of benchmarks, including AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL, designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities.
The results highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.
Benchmark results:
- AIME24: QwQ-32B achieved 79.5, slightly behind DeepSeek-R1-6718’s 79.8, but significantly ahead of OpenAl-o1-mini’s 63.6 and the distilled models.
- LiveCodeBench: QwQ-32B scored 63.4, again closely matched by DeepSeek-R1-6718’s 65.9, and surpassing the distilled models and OpenAl-o1-mini’s 53.8.
- LiveBench: QwQ-32B achieved 73.1, with DeepSeek-R1-6718 scoring 71.6, and outperforming the distilled models and OpenAl-o1-mini’s 57.5.
- IFEval: QwQ-32B scored 83.9, very close to DeepSeek-R1-6718’s 83.3, and leading the distilled models and OpenAl-o1-mini’s 59.1.
- BFCL: QwQ-32B achieved 66.4, with DeepSeek-R1-6718 scoring 62.8, demonstrating a lead over the distilled models and OpenAl-o1-mini’s 49.3.
The Qwen team’s approach involved a cold-start checkpoint and a multi-stage RL process driven by outcome-based rewards. The initial stage focused on scaling RL for math and coding tasks, utilising accuracy verifiers and code execution servers. The second stage expanded to general capabilities, incorporating rewards from general reward models and rule-based verifiers.
“We find that this stage of RL training with a small amount of steps can increase the performance of other general capabilities, such as instruction following, alignment with human preference, and agent performance, without significant performance drop in math and coding,” the team explained.
QwQ-32B is open-weight and available on Hugging Face and ModelScope under the Apache 2.0 license, and is also accessible via Qwen Chat. The Qwen team views this as an initial step in scaling RL to enhance reasoning capabilities and aims to further explore the integration of agents with RL for long-horizon reasoning.
“As we work towards developing the next generation of Qwen, we are confident that combining stronger foundation models with RL powered by scaled computational resources will propel us closer to achieving Artificial General Intelligence (AGI),” the team stated.
See also: Deepgram Nova-3 Medical: AI speech model cuts healthcare transcription errors
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.