[Manually cross-posted to LessWrong here.]
There are some great collections of examples of things like specification gaming, goal misgeneralization, and AI improving AI. But almost all of the examples are from demos/toy environments, rather than systems which were actually deployed in the world.
There are also some databases of AI incidents which include lots of real-world examples, but the examples aren't related to failures in a way that makes it easy to map them onto AI risk claims. (Probably most of them don't in any case, but I'd guess some do.)
I think collecting real-world examples (particularly in a nuanced way without claiming too much of the examples) could be pretty valuable:
- I think it's good practice to have a transparent overview of the current state of evidence
- For many people I think real-world examples will be most convincing
- I expect there to be more and more real-world examples, so starting to collect them now seems good
What are the strongest real-world examples of AI systems doing things which might scale to AI risk claims?
I'm particularly interested in whether there are any good real-world examples of:
- Goal misgeneralization
- Deceptive alignment (answer: no, but yes to simple deception?)
- Specification gaming
- Power-seeking
- Self-preservation
- Self-improvement
This feeds into a project I'm working on with AI Impacts, collecting empirical evidence on various AI risk claims. There's a work-in-progress table here with the main things I'm tracking so far - additions and comments very welcome.
Buckman's examples are not central to what you want but worth reading: https://jacobbuckman.com/2022-09-07-recursively-self-improving-ai-is-already-here/