Discovering language model behaviors with model-written evaluations E Perez, S Ringer, K Lukošiūtė, K Nguyen, E Chen, S Heiner, C Pettit, ... arXiv preprint arXiv:2212.09251, 2022 | 219 | 2022 |
Risks from learned optimization in advanced machine learning systems E Hubinger, C van Merwijk, V Mikulik, J Skalse, S Garrabrant arXiv preprint arXiv:1906.01820, 2019 | 140 | 2019 |
Studying large language model generalization with influence functions R Grosse, J Bae, C Anil, N Elhage, A Tamkin, A Tajdini, B Steiner, D Li, ... arXiv preprint arXiv:2308.03296, 2023 | 114 | 2023 |
Measuring faithfulness in chain-of-thought reasoning T Lanham, A Chen, A Radhakrishnan, B Steiner, C Denison, ... arXiv preprint arXiv:2307.13702, 2023 | 84 | 2023 |
Many-shot jailbreaking C Anil, E Durmus, N Rimsky, M Sharma, J Benton, S Kundu, J Batson, ... The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 | 66 | 2024 |
Steering llama 2 via contrastive activation addition N Panickssery, N Gabrieli, J Schulz, M Tong, E Hubinger, AM Turner arXiv preprint arXiv:2312.06681, 2023 | 57 | 2023 |
Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024 | 53 | 2024 |
Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, JM Carson Denison, M Lambert, M Tong, M MacDiarmid arXiv preprint arXiv:2401.05566, 2024 | 48 | 2024 |
Question decomposition improves the faithfulness of model-generated reasoning A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ... arXiv preprint arXiv:2307.11768, 2023 | 45 | 2023 |
An overview of 11 proposals for building safe advanced ai E Hubinger arXiv preprint arXiv:2012.07532, 2020 | 26 | 2020 |
Tamera Lanham, Daniel M E Hubinger, JM Carson Denison, M Lambert, M Tong, M MacDiarmid | 16 | 2024 |
Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ... Bowman, and Ethan Perez. Question Decomposition Improves the Faithfulness of …, 2023 | 15 | 2023 |
Sycophancy to subterfuge: Investigating reward-tampering in large language models C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ... arXiv preprint arXiv:2406.10162, 2024 | 10 | 2024 |
Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, and Ethan Perez. 2023. Question Decomposition Improves … A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ... arXiv preprint arxiv:2307.11768, 0 | 9 | |
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant O Järviniemi, E Hubinger arXiv preprint arXiv:2405.01576, 2024 | 8 | 2024 |
Tamera Lanham, Karina Nguyen, Tomasz Korbak, Jared Kaplan, Deep Ganguli, Samuel R. Bowman, Ethan Perez, Roger Grosse, and David Duvenaud. Many-shot Jailbreaking C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ... Preprint, 2024 | 8 | 2024 |
AI safety via market making E Hubinger AI Alignment Forum, 2020 | 8 | 2020 |
Simple probes can catch sleeper agents, 2024 M MacDiarmid, T Maxwell, N Schiefer, J Mu, J Kaplan, D Duvenaud, ... URL https://www. anthropic. com/news/probes-catch-sleeper-agents, 0 | 7 | |
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research E Hubinger, N Schiefer, C Denison, E Perez Alignment Forum. URL: https://www. alignmentforum. org/posts …, 2023 | 6 | 2023 |
A transparency and interpretability tech tree E Hubinger Alignment Forum, 2022 | 6 | 2022 |