[Manually cross-posted to LessWrong here.]
There are some great collections of examples of things like specification gaming, goal misgeneralization, and AI improving AI. But almost all of the examples are from demos/toy environments, rather than systems which were actually deployed in the world.
There are also some databases of AI incidents which include lots of real-world examples, but the examples aren't related to failures in a way that makes it easy to map them onto AI risk claims. (Probably most of them don't in any case, but I'd guess some do.)
I think collecting real-world examples (particularly in a nuanced way without claiming too much of the examples) could be pretty valuable:
- I think it's good practice to have a transparent overview of the current state of evidence
- For many people I think real-world examples will be most convincing
- I expect there to be more and more real-world examples, so starting to collect them now seems good
What are the strongest real-world examples of AI systems doing things which might scale to AI risk claims?
I'm particularly interested in whether there are any good real-world examples of:
- Goal misgeneralization
- Deceptive alignment (answer: no, but yes to simple deception?)
- Specification gaming
- Power-seeking
- Self-preservation
- Self-improvement
This feeds into a project I'm working on with AI Impacts, collecting empirical evidence on various AI risk claims. There's a work-in-progress table here with the main things I'm tracking so far - additions and comments very welcome.
Maybe it doesn’t quite answer your question, but Bing was an example that AI arms races are likely due to the profit incentive. Thought it was worth mentioning because that's part of the story.
The only part of the Bing story which really jittered me is that time the patched version looked itself up through Bing Search, saw that the previous version Sydney was a psycho, and started acting up again. "Search is sufficient for hyperstition."