Replicating "Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models"
April 16, 2025
A replication of a paper examining how adding noise to model weights can reveal hidden capabilities in language models that are sandbagging.
I'm currently working on AI safety. Excited about a number of research directions, and currently experimenting with projects in AI control evaluations, encoded reasoning in chain of thought, and model organisms of misalignment.
Previously, I spent two years as a trading engineer at ExodusPoint Capital, where I worked on a small team developing new automated strategies and building the firm's trading infrastructure. I got my Master's and Bachelor's in Computer Science from Georgia Tech, specializing in Computing Systems.
Happy to talk about any of this stuff or anything else. Can be reached at keshavsy[at]gmail[dot]com or schedule a call with me here or on LinkedIn below.
Some of the projects I've been working on recently. Posts may not be up to date, but happy to chat and provide updates about any of these.
April 16, 2025
A replication of a paper examining how adding noise to model weights can reveal hidden capabilities in language models that are sandbagging.
April 5, 2025
A preliminary control evaluation for document summarization tasks. Can an untrusted model lie about the results of a paper while evading detection?
March 12, 2025
A replication of a paper examining the ability of frontier AI models to deliberately underperform when incentivized.