anakin87 11 hours ago

I experimented with GRPO lately, since I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.

After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game, but I wanted a different challenge. So I opted for teaching a model to create a schedule from a list of events and priorities.

Choosing an original problem forced me to think about the problem setting, generate data, choose the base model, design reward functions, and run multiple rounds of training, hoping that my model would learn something.

A fun and rewarding experience. :-)

I learned a lot of things, that I want to share with you.

Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo

Code: https://github.com/anakin87/qwen-scheduler-grpo

Hugging Face collection (dataset and model): https://huggingface.co/collections/anakin87/qwen-scheduler-g...