For a while now, reinforcement learning has been the standard way to fine tune LLMs. It works in the action space, i.e. exploring alternative actions to maximize reward, which is difficult especially with long action sequences. If we could explore LLM parameters directly, it might be possible to find more systematic and creative changes.
Indeed parameter-space fine tuning is possible, and in a surprising way: evolution strategies, i.e. population-based search, can be scaled up to optimize billions of parameters in LLMs. It outperforms PPO and GRPO in a symbolic reasoning task, for example, while being more consistent and sample-efficient. Its gradient-free nature makes it a promising alternative in environments where RL is fragile or costly to apply.
Read the blog (with a video and a conceptual illustration):
https://www.cognizant.com/us/en/ai-lab/blog/llm-fine-tuning-with-esRead the full paper:
https://arxiv.org/abs/2509.24372Explore the code:
https://github.com/VsonicV/es-fine-tuning-paper