Deep Thinking: Reinforcement Learning and Benchmarking for Better LLMs
The 20th edition of Better AI Meetup will be all about fine-tuning and benchmarking LLMs to help people get the most out of them.
LIVESTREAM LINK: coming up
The 20th volume of Better AI Meetup will somewhat return to the topic of the very first meetup in 2021, where the first ever Slovak language neural model was presented. This meetup will be all about fine-tuning and benchmarking LLMs to help people get the most out of them.
First, Jakub Mačina, an LLM researcher from ETH Zurich will talk about a post-training alignment method for collaborative agents to help LLMs co-reason and co-act with humans. He will show how an 7B open model can rival larger proprietary LLMs by thoughtful use of reinforcement learning.
Marek Šuppa, a Principal Machine Learning and AI Engineer at Slido and Andrej Ridzik, a Senior Research Engineer at KInIT will show us how language models are evaluated to separate real capability from marketing fluff. They will cover the ongoing effort to create a benchmark model for Slovak language, why different models require different benchmarks and what are their challenges.
SPEAKERS
Jakub Mačina
Researcher at the ETH Zurich
Jakub Mačina is a researcher at the ETH Zurich working on large language models (LLMs) for reasoning and multi-turn capabilities with applications to education. He earned his PhD in Machine Learning at ETH Zurich as a Fellow of the ETH AI Center and is a Forbes 30U30 in the category of Science and Education. Previously, he led a machine learning team at Slovak startup Exponea (acquired by Bloomreach).
Marek Šuppa
Principal Machine Learning / AI Engineer at Slido
Marek Šuppa is a Principal Machine Learning / AI Engineer at Slido, acquired by Cisco. He leads Slido’s Data team, and before that was one of the early employees of DuckDuckGo, the search engine that does not track you. He also teaches at Comenius University in Bratislava. His work spans natural language processing, machine learning, robotics, and applied AI. He helps organize various events, such as RoboCup, Slovakia’s Int. Olympiad in AI, AI Build Day or Bratislava Slush’D.
Andrej Ridzik
Senior Research Engineer at KInIT
Andrej Ridzik is a Senior Research Engineer at the Kempelen Institute of Intelligent Technologies (KInIT), where he works on a portfolio of NLP activities spanning research, commercial projects, and LLM-based solutions for Slovak. He represents KInIT as the lead for benchmarking activities within the Slovak NLP Community, where he co-leads the development of benchmarks for the Slovak language. He has worked in AI and NLP for over a decade, with prior experience as an engineer across industry and research projects.
Join us online for the 20th Better AI Meet Up on June 24th!
Good to know
Highlights
- 2 hours
- Online
Location
Online event
Agenda
_ Welcome
-
Post-training LLMs for Collaboration using Reinforcement Learning
Reinforcement learning (RL) has made LLMs stronger reasoners by scaling test-time compute, but most optimize for single-turn problem solving. This talk presents a post-training alignment method for collaborative agents: models that co-reason and act with humans across multi-turn tutoring and planning settings, showing how a 7B open model can rival larger proprietary LLMs. It highlights practical lessons for applying RL across verifiable and less verifiable domains while preserving accuracy and avoiding SFT-style overspecialization.
-
Benchmarking Language Models for Slovak: What It Takes and Why It Matters.
As language models become central to NLP applications, reliable evaluation is what separates real capability from marketing claims – and building good benchmarks is far from trivial, especially for languages with limited resources like Slovak. This talk presents the ongoing effort to build reliable evaluation benchmarks for Slovak language models. It covers why different model types require different benchmarks, how raw datasets are turned into well-defined evaluation tasks, and the specific challenges of evaluating LLM outputs. It also shares lessons learned from delivering these benchmarks and reflects on the role of cross-team collaboration in scaling such efforts.