AI Evaluations

Many stories of AI accident and misuse involve potentially dangerous capabilities, such as sophisticated deception and situational awareness, that have not yet been demonstrated in AI. Can we evaluate such capabilities in existing AI systems to form a foundation for policy and further technical work?

Mentors

Evan Hubinger
Research Scientist, Anthropic

Evan is open to a range of projects from empirical to theoretical alignment research, specifically interested in deceptive alignment, predictive model conditioning, and situational awareness in LLMs.

Francis Rhys Ward
PhD Student, Imperial College London

Francis focuses on deception and model evaluations in AI, seeking experienced scholars in theoretical or empirical approaches to advance projects on formalizing self-awareness and evaluating AI agency.

  • Francis Rhys Ward is a part-time contractor for the autonomous systems team at AISI whilst he finishes his PhD at Imperial College London. He is a member of Tom Everitt's causal incentives group, a recent GovAI summer fellow, and previously a CLR summer fellow. Francis' technical work focuses on deception and model evaluations.

  • Some indicative projects are below, though I expect to flexibly fit a project to the intersection of our interests and skills.

    Formalizing self-awareness

    Self-awareness (SA, a.k.a. situational-awareness) is a key property associated with agents capable of deceptive alignment, and has implications for the process by which agents determine their goals. However, there is no universally accepted philosophical theory of SA, and no formal theory of SA applicable to advanced ML agents (such as LMs). The extent to which AI systems are, or could be, self-aware, and how to evaluate this, is unclear. This project aims to mathematically formalize SA, enabling formal evaluations of SA in LMs and other AI systems.

    Agency Evaluations

    The primary threat-models for AI x-risk depend on AI systems that are agents or “agentic.” Although there is no complete theory of agency, there are several important dimensions of agency, such as the coherence of a system's goals or beliefs. This project aims to empirically evaluate key dimensions of agency in frontier LMs, for example, by designing a method for evaluating the extent to which OpenAI models pursue the goal of being “helpful and harmless.”

  • Ideal candidates should have:

    • For theory work:

      • Experience doing mathematics/theoretical CS research, in particular in Perlian causality, game theory.

      • Background in philosophy.

    • For empirical work:

      • Experience evaluating and fine-tuning LMs.

    Mentorship looks like:

    • Weekly meetings.

    • Slack any time (response usually <24 hours).

    See previous mentorship feedback here and here.

Jérémy Scheurer
Research Scientist, Apollo Research

Jérémy focuses on understanding and reducing risks from large language models, emphasizing practical, engineering-heavy projects involving strategic deception and deceptive alignment.

  • Jérémy Scheurer is a research scientist in the evaluations team at Apollo Research, an AI evaluations organization that does interpretability research. He recently published work on how “Large Language Models can Strategically Deceive their Users when Under Pressure.” Just before that, Jérémy contracted with OpenAI and worked in its dangerous capabilities evaluations team. Previously, Jérémy was a Research Scientist at FAR AI (and NYU), collaborating with Ethan Perez, where he published work on learning from language feedback. His research interests include understanding and predicting (dangerous) capabilities in large language models (LLMs), creating LM evaluations, developing the science of evals, evaluating and preventing deceptive alignment, and combining evals with interpretability methods.

  • My research is focused on measuring and reducing risks from LLMs:

    I’m generally interested in projects that demonstrate dangerous capabilities, especially deceptive alignment (model organisms), and projects that try to understand, measure, or predict such capabilities. Separately, I want to push forward the science of evals, i.e., I want to develop the understanding that makes the field of evals a more rigorous science instead of an art that largely depends on specific prompts, etc. Towards developing a science of evals, I want to understand better how capabilities emerge, how supervised fine-tuning and RLHF change capabilities, how we can predict them, etc. Finally, I’m very interested in developing novel methods that combine interpretability and evaluations in order to go beyond black-box evaluations.

    To better understand what kind of projects I’m generally interested in, I provide a non-exhaustive list of relevant work (in no particular order and leaving out many areas I’m also interested in):

  • As an indicative guide (this is not a score sheet), I evaluate candidates according to the following criteria:

    • Engineering skills:

      • Background: Do you have a solid engineering foundation? Are you proficient in Python and familiar with key libraries like LLM APIs and Pytorch?

      • Machine Learning Experience: Have you fine-tuned open-source models, written high-quality, reproducible code, and engaged in empirical work?

      • Large Language Models (LLMs): Have you experimented with LLMs like

      • Engineering is the skill I probably value most overall. Empirically I have observed that being able to quickly implement clean experiments and run them, is an extremely valuable skill.

    • Prompt engineering skills:

      • Have you experimented with LLMs like GPT-4 or Claude 3?

      • Do you understand their capabilities and limitations? Do you have good intuitions about how they solve problems?

      • A typical project with me will most likely involve “playing” around with LLMs for a while in order to get good intuitions on a certain question. Then, we will more quantitatively investigate/evaluate them and write code for large-scale evaluations etc. But a lot of the work will most likely involve interacting with LLMs.

      • While expertise is not required, a willingness to learn and experiment with LLMs is crucial.

    • Science Background:

      • I seek candidates who demonstrate scientific thinking, can manage research projects, find solutions independently, ask relevant questions, seek necessary help, and understand related literature.

    • Self-Drivenness:

      • Are you motivated and capable of advancing the project independently? Can you explore new ideas and avenues driven by curiosity?

      • If you have bugs or other technical issues, can you find solutions to them yourself? Can you ask for advice and help from other experts in the field?

      • You should not get the sense that you will be on your own. I will be fairly hands-on throughout the project and support you whenever you need help. However, I still think that being self-driven is an important trait to have and can determine a lot about the success of a project.

    • Working in an Organized manner:

      • Structured Work: Can you work systematically, seek help when needed, and communicate your ideas clearly and efficiently?

      • Code and Documentation: Is your code well-organized for reproducibility? Can you consistently record and present your work?

      • Are you find with presenting slides every week with your updates?

      • I’m mostly highlighting this point since I like being organized and structured. If you prefer a more “random” approach of exploring various things at the same time (which can work really well for some people), I might not be a good fit.

    Mentorship approach:

    Expect a 1-hour weekly meeting with additional Slack communication. I'm available for extra meetings and quick Slack responses for urgent matters.

    I will try to help you in your research and teach you what I know as much as I can. However, I point out that I don’t yet have a lot of experience mentoring people, which means I will most likely make mistakes. I am, however, very eager to improve, and you will thus have more influence on the mentor-mentee relationship. I’m looking forward to input on how we can make the most out of the mentorship.

    My projects are very empirical and engineering-heavy. Other mentors might be a better fit if you want to do more conceptual, theoretical, or mathematical work. My projects can involve learning to execute well on scoped-out research projects (as a first step for getting into research). I have a few very concrete projects that I’d be excited for you to work on. However, I want you to be excited about your work since this is key to a successful project. If you have concrete projects that you would like to work on that fit into my stream, I’m happy to mentor you on those. If you know that you definitely want to work on your own project, I will, however, only be able to select you if I am interested/excited about the project and believe that my skills would add value to it.

Owain Evans
Research Associate, Oxford University

Owain researches situational awareness in LLMs, predicting the emergence of dangerous capabilities, and enhancing human abilities to interact with and control AI through understanding of honesty and deception.

  • Owain is currently focused on:

    • Defining and evaluating situational awareness in LLMs (relevant paper)

    • How to predict the emergence of other dangerous capabilities empirically (see “out-of-context” reasoning and the Reversal Curse)

    • Honesty, lying, truthfulness and introspection in LLMs

    He leads a research group in Berkeley and has mentored 25+ alignment researchers in the past, primarily at Oxford’s Future of Humanity Institute.

    • Defining and evaluating situational awareness in LLMs (relevant paper)

    • Predicting the emergence of other dangerous capabilities in LLMs (e.g. deception, agency, misaligned goals)

    • Studying emergent reasoning at training time (“out-of-context” reasoning). See Reversal Curse.

    • Detecting deception and dishonesty in LLMs using black-box methods

    • Enhancing human epistemic abilities using LLMs (e.g., Autocast, TruthfulQA)

  • Some of Owain's projects involve running experiments on large language models. For these projects, scholars need to have some kind of experience running machine learning experiments (either with LLMs or some other kind of machine learning model).

Marius Hobbhahn
CEO, Apollo Research

Marius is working on quantifying AI evaluations and understanding model goals through behavioral analysis, aiming to refine AI auditing and oversight methods.

  • Marius Hobbhahn is currently building up a new technical AI safety organization, Apollo Research. His organization’s research agenda revolves around deception, interpretability, model evaluations and auditing. Before founding this organization, he was a Research Fellow at Epoch and a MATS Scholar while he pursued his PhD in Machine Learning (currently on pause) at International Max-Planck Research School Tübingen.

  • Here are suggestions for potential research projects supervised by Marius:

    • Quantifying evaluations: For evals to be maximally useful, we would like to make quantitative statements about their predictions, e.g. “it takes X effort to achieve Y capability”. This project would first investigate different ways of quantifying evals, e.g. FLOP, hours invested, money invested, and more (maybe 2 weeks). Then, the scholar would empirically test the robustness of these ideas, e.g. by identifying empirical scaling laws. The project allows for independence and gives the scholar the opportunity to decide on research directions.

    • Goal evaluation and identification in NNs: For alignment, it is very important to know whether models are goal-directed and, if yes, what goals they have. A deep understanding of goals might only be attainable through interpretability but behavioral measures might bring us quite far. This project aims to build simple behavioral evaluations of goal-directedness. The early parts of the project are well-scoped. After the initial phase is done, the scholar has more freedom to decide on research directions.

    • Measuring the quality of eval datasets: We want to understand how we can measure the quality of eval datasets. In the beginning, we would investigate if there are systematic biases between human-written and model-written evals on the Anthropic model-written evals dataset. Apollo has already invested ~2 weeks into this project and there are interesting early findings. This project is very well-scoped and can be done with very little research experience.

    If you have a high-quality research proposal, Marius might also be willing to supervise that. It’s also possible and encouraged to team up with other scholars to work on the same project. For more background and context on these research projects reading our post on Science of Evals might be helpful.

  • Mentorship:

    • Marius offers between 1-3h of calls per week (depending on progress) with asynchronous slack messages.

    • The goal of the project is to write a blog post or other form of publication.

    You might be interested in that project if:

    • You have some basic research experience and enthusiasm for research.

    • You have enough software experience that you can write the code for this project. 1000+ hours of Python might be a realistic heuristic.

    • You want to take ownership of a project. Marius will provide mentorship but you will spend most time on it, so it should be something you’re comfortable leading.

    You can either work on projects alone or partner up with other scholars.