Honours Thesis — Class of 2026
Proud to share the work of my undergraduate honours thesis students from the Fall 2025–Winter 2026 cohort. Each of them completed an independent research project, wrote a full thesis, and presented their findings. It was a great group to work with.
Alex Lowe
A 2D-Gaming Benchmark for Deep Reinforcement Learning
Can DRL realistically serve as a playtesting tool for indie developers? Alex built a benchmark of three games (Flappy Bird, Pong, Snake) from scratch and trained four algorithms (PPO, A2C, TRPO, RPPO) on a consumer machine without a GPU. The results showed that while trained agents can serve as lightweight balance measurement tools, reward engineering remains the dominant barrier to adoption for small studios.
Thesis · Slides
A 2D-Gaming Benchmark for Deep Reinforcement Learning
Can DRL realistically serve as a playtesting tool for indie developers? Alex built a benchmark of three games (Flappy Bird, Pong, Snake) from scratch and trained four algorithms (PPO, A2C, TRPO, RPPO) on a consumer machine without a GPU. The results showed that while trained agents can serve as lightweight balance measurement tools, reward engineering remains the dominant barrier to adoption for small studios.
Thesis · Slides
Rabia Chattha
Evaluating Large Language Models for Docstring-to-Code Alignment
How well can LLMs associate documentation with the exact code it describes? Rabia framed this as a structured matching task — replacing docstrings with slot identifiers that models must correctly assign from a candidate list. Evaluating several modern LLMs on real-world repositories, she found that while some models perform strongly, inconsistencies in completeness and reliability remain a challenge.
Thesis
Evaluating Large Language Models for Docstring-to-Code Alignment
How well can LLMs associate documentation with the exact code it describes? Rabia framed this as a structured matching task — replacing docstrings with slot identifiers that models must correctly assign from a candidate list. Evaluating several modern LLMs on real-world repositories, she found that while some models perform strongly, inconsistencies in completeness and reliability remain a challenge.
Thesis
Ryan Ahlborn
Context-Aware Hint Generation for Serious Games Using Large Language Models in Gidget
Traditional hint systems in educational games rely on manually written rules that are hard to scale. Ryan built an LLM-powered hint pipeline for Gidget, a programming education game, that reads the current game state and the player's attempt, and delivers hints directly inside the game interface. User testing across 88 hint requests showed 80.7% were immediately helpful and 93.2% were factually correct.
Thesis · Slides
Context-Aware Hint Generation for Serious Games Using Large Language Models in Gidget
Traditional hint systems in educational games rely on manually written rules that are hard to scale. Ryan built an LLM-powered hint pipeline for Gidget, a programming education game, that reads the current game state and the player's attempt, and delivers hints directly inside the game interface. User testing across 88 hint requests showed 80.7% were immediately helpful and 93.2% were factually correct.
Thesis · Slides
Saahir Dhani
Persona-Based Reward Shaping in Deep Reinforcement Learning
Is reward design alone enough to produce agents with distinct, stable personalities? Saahir introduced three personas — Speedrunner, Survivor, and Greedy — in a custom Breakout environment extended with lives, energy, and special abilities, each trained with PPO using only a different reward function. The results showed clear and consistent behavioral separation across all five measured metrics. A notable finding: the Survivor agent discovered reward hacking by refusing to launch the ball to avoid losing lives, which had to be corrected with an auto-launch constraint.
Thesis · Slides
Persona-Based Reward Shaping in Deep Reinforcement Learning
Is reward design alone enough to produce agents with distinct, stable personalities? Saahir introduced three personas — Speedrunner, Survivor, and Greedy — in a custom Breakout environment extended with lives, energy, and special abilities, each trained with PPO using only a different reward function. The results showed clear and consistent behavioral separation across all five measured metrics. A notable finding: the Survivor agent discovered reward hacking by refusing to launch the ball to avoid losing lives, which had to be corrected with an auto-launch constraint.
Thesis · Slides
Saffron Birch
Can LLM-Driven NPCs Remain in Character? Investigating Challenges in Generative AI-based NPC Development and Testing
LLM-powered NPCs can generate dynamic dialogue, but they struggle with hallucination, personality drift, and adversarial prompts. Saffron built a structured evaluation framework combining a cognitive memory system with a persona monitoring layer across four guardrail dimensions: personality alignment, knowledge filtration, bias mitigation, and narrative adherence. Tested on an NPC configured as Geralt of Rivia from The Witcher 3, the guardrail system raised the all-pass rate from 67.6% to 83.8% on a 37-question adversarial test suite.
Thesis · Slides
Can LLM-Driven NPCs Remain in Character? Investigating Challenges in Generative AI-based NPC Development and Testing
LLM-powered NPCs can generate dynamic dialogue, but they struggle with hallucination, personality drift, and adversarial prompts. Saffron built a structured evaluation framework combining a cognitive memory system with a persona monitoring layer across four guardrail dimensions: personality alignment, knowledge filtration, bias mitigation, and narrative adherence. Tested on an NPC configured as Geralt of Rivia from The Witcher 3, the guardrail system raised the all-pass rate from 67.6% to 83.8% on a 37-question adversarial test suite.
Thesis · Slides
Saksham Tejpal
A Large Dataset of Video Game Patch Notes
Patch notes are a rich but messy record of how games evolve — Steam mixes them with sales posts, community updates, and promotional content, making large-scale study impractical. Saksham built a four-stage pipeline that retrieves, filters, classifies, and serves patch notes at scale: starting from 145,622 games and 1.87 million news items, the pipeline uses Gemini 2.5 Flash to confirm 817,765 genuine patch notes across 68,043 games. It then extracts 8.3 million individual change statements tagged as bug fixes (36.2%), feature additions (45.8%), or balance changes (18.0%), served through a searchable web application.
Thesis · Slides
A Large Dataset of Video Game Patch Notes
Patch notes are a rich but messy record of how games evolve — Steam mixes them with sales posts, community updates, and promotional content, making large-scale study impractical. Saksham built a four-stage pipeline that retrieves, filters, classifies, and serves patch notes at scale: starting from 145,622 games and 1.87 million news items, the pipeline uses Gemini 2.5 Flash to confirm 817,765 genuine patch notes across 68,043 games. It then extracts 8.3 million individual change statements tagged as bug fixes (36.2%), feature additions (45.8%), or balance changes (18.0%), served through a searchable web application.
Thesis · Slides