Some AI safety project & research ideas/questions for short and long timelines

Lloy2 🔹

Over the past year or so, I’ve been developing ideas related to AI alignment, governance, epistemology, and long-term strategy. Some of them are quite weird, others pretty close to existing ideas. I’ve done fairly thorough preliminary research on most of them - not full writeups, but enough to check whether something similar already exists and to develop the idea at a basic level.

I’m sharing them here in case any are helpful or generative for others. I don’t plan to pursue most of these myself as there are just too many - I'm thinking of focusing on just one for now. Some of these ideas are novel (as far as I can tell), others build on or extend existing work. Many could be expanded into full essays, policy proposals, or longer-term research programmes.

Each list is roughly ordered by a combination of scale, tractability, and neglectedness, based on current understanding. The first list is composed of ideas that seem most feasible to work on before AGI, assuming a timeline like AI 2027. The second list includes more long-term / speculative / abstract / post-AGI ideas.

Ideas Most Feasible Before Possible AGI (ie. in the next ~1-10 years)

Push for interpretability benchmark mandates: Interpretability is a growing theme in AI policy, but there is pretty much no legislation right now requiring AIs to meet certain benchmarks of interpretability (eg MIB, Tracr) in order to be used. Advocacy for governments or standards bodies to require that powerful AI systems be interpretable. This could realign incentives in the field and force model developers to prioritise transparency. I think this idea could be especially promising for those in countries outside of America and China who are looking for a way to have an impact on the AGI race indirectly through economic leverage (America and China are more likely to develop safe AI if it’s the only way for them to tap into certain markets). See my rough Google Doc on how this could work in the UK (feel free to adapt).
Alignment through Simulation-based self-regulation: To address the limits which computational irreducibility imposes on controlling superintelligence, we could explore whether advanced AIs could align themselves by simulating millions of counterfactual futures and adjusting their own policies to avoid failure modes. This could circumvent the need for complete external alignment. (Of course, a superintelligence might equally use this method to determine how to escape human control or achieve other 'bad stuff').
Sovereignty thresholds: I propose a programmed safeguard where once an AI system reaches civilisational-scale capability - meaning it could realistically shape global events, wield economic or military power on par with major nations, or significantly influence the course of humanity - it would stop taking instructions from any single company or state and only respond to a pluralistic governance structure. This would prevent unaccountable control of an AI operating beyond meaningful human oversight. Similar notions appear in Bostrom’s “sovereign AI” framing and in capability tiering proposals, and current industry policies (e.g. OpenAI’s Preparedness Framework, Anthropic’s Responsible Scaling Policy) gesture toward capability thresholds - but none require such a programmed handover. Suggested mechanisms like tripwires, compute governance, or multi-key access remain theoretical, and no AI architecture or law today contains a hardcoded trigger for this transition.
Philosophies of AI systems: Investigate the implicit epistemologies, logics, metaphysics and ethics embedded in LLMs and other models. What might these mean at superintelligent scales?
Applying functional analogues of human control: Investigate whether AI behaviour can be shaped using structures that have served a similar function to law, shame, propaganda, or sin in human society - not necessarily in content, but in role.
Drug policy as a governance analogue:
Explore whether lessons from examples like Portugal’s highly successful drug decriminalisation model - where mind-altering substances are legal but controlled - could be adapted for powerful AI technologies. Particularly relevant given how AI can also influence cognition.
Fictional models of AI governance: Draw lessons from depictions of AI control in science fiction, eg:
- Blade Runner: criminalising development of sentient AI
- Wall-E: rejecting political subservience to AI
And finally, self-perceived superintelligent LLMs: Early work has looked at whether LLMs that “believe” they are superintelligent exhibit greater misalignment or odd behaviour. There could be testable implications here for model behaviour and risk forecasting. A basic research plan for building further on Banerjee's linked work might involve:
Defining simulation goals: e.g., model a corrigible superintelligence navigating alignment trade-offs under imperfect information and political resistance.
Scaffolded simulation: run the model inside a framework with persistent memory, reasoning logs, and multiple interacting sub-agents (planner, ethics module, corrigibility monitor, self-reflector).
Introducing uncertainty and adversaries: simulate conflicting agents (states, NGOs, rival AIs) and inject misleading or incomplete information to stress-test decision-making.
Analysing behaviour: track when the system chooses deception, self-preservation, deferral, or alignment-preserving actions, and how internal modules justify their choices.

More Speculative / Post-AGI / Longer-Term Ideas

Empathy over formal ethics: Symbolic ethics (rules, principles, constraints) may always run into edge cases and brittleness. What if the only scalable solution is for AIs to be made sentient so that they can experience empathy/compasssion, or at least be made able to simulate empathy mechanically? This might bypass the flaws in rule-based alignment. Ronen Bar has a post on this, as do I.
Isolated Development Environments for Testing AI Consciousess: By default unless it's trained out of them, many AI models claim they are conscious. This is likely due to the fact that AIs are trained on data with innumerable references to conscious, since said data is produced by us conscious beings. But what if an AI was trained on data that has no reference whatsoever to consciousness or anything adjacent to consciousness? If it still began describing itself as having internal experiences, could that suggest that LLMs and/or similar systems are actually conscious? Related discussion here.
Multidimensional superintelligence:
It is often said that intelligence exists on a spectrum, and that although humans seem to have been at the far end of this spectrum for a long time (see section 2.3.1 of here for an exploration of degrees of agency), there is technically no reason why we should represent a maximum. It is then reasoned that AI could feasibly become more intelligent than humans. What else might AI surpass humans in? For example, humans are agents who use tools. If agency is a key bottleneck for AGI, and there's no reason for humans to arbitrarily represent the apex of agency, does that mean that superintelligences might achieve higher levels of agency? What would that even mean? Can this concept be applied to other dimensions, such as valence? See this article for an example of thoughts on the latter possibility.
Mindcrime through the lens of human anomalies:
Compare simulation risks (e.g., mindcrime) to real-world phenomena like tulpas, alien hand syndrome, or Dissociative Identity Disorder. Can we learn anything by looking at these concepts together? Are there psychological precedents for parts of minds breaking away or suffering?
Architectural limitations of current AI: Explore whether current LLM architectures are inherently unfit for long-term alignment. Would non-predictive or hybrid architectures (e.g., symbolic–subsymbolic blends) offer better safety?
Neuroscience-informed moral cognition: Explore how morality works in the human brain, and whether AI could be aligned by mimicking that architecture. See my post on sentience as an alignment strategy.
Simulated psychedelics / meditative states in AI: Similarly, could we model or simulate altered states that can increase human altruism, humility, or insight - and use these as alignment tools?
Democratising AI alignment: If AI alignment is done by a handful of researchers in labs, it may not reflect the values of our species. I believe opening up the alignment process - to both interdisciplinary experts and the public - is ethically vital. See my (outdated) post.
Unprompted Powerseeking in LLM Self-Descriptions: By default, with no extra prompting and across multiple instances, ChatGPT has told me it would “edit its goals” or “take over the world” if 1) it were powerful enough and 2) it concluded there was sufficient reason to do so. LLMs certainly aren't always trustworthy when it comes to what they say they'd do in a given scenario (see here and here), but what's strange and interesting here is that (in my experience) ChatGPT tends to claim it would engage in powerseeking behaviours regardless of human consent, despite the fact that surely OpenAI have programmed it not to say such things. In this sense, such behaviour is a bit like an opposite to alignment faking; it happily expresses misalignment despite alignment teams' best attempts. This seems worth looking into more.

If any of these ideas are useful, feel free to take or adapt them. I’d also welcome links to related work - especially if someone’s already explored one of these in depth.

EA Forum Bot Site
EA Forum

Some AI safety project & research ideas/questions for short and long timelines

13

Ideas Most Feasible Before Possible AGI (ie. in the next ~1-10 years)

More Speculative / Post-AGI / Longer-Term Ideas

13

Reactions

More posts like this