I'm a mathematician working mostly on technical AI safety and a bit on collective decision making, game theory, and formal ethics. I used to work on international coalition formation, and a lot of stuff related to climate change. Here's my bot posting about my main project. Here's my professional profile.
My definition of value :
I need help with various aspects of my two main projects: a human empowerment based AI safety concept, and an open-source collective decision app, http://www.vodle.it
I can help by ...
I wonder how to correctly conceptualize the idea of "a net-negative influence on civilization" in view of the fact that the future is highly uncertain and that that uncertainty is a major motivating factor.
E.g., assume at some time point t1, a longtermist's proposed plan has higher expected longterm value than an alternative plan because the alternative plan takes a major risk. The longtermist's plan is realized and at some later time point t2 someone points out that the alternative plan would have produced more value between t1 and t2 (tacitly assuming the risk not realizing between t1 and t2 because the realized longterm plan has successfully avoided it).
Would that constitute an example of what these critics would call a "net-negative influence on civilization"? If so, it's just a fallacy. If not, then what comparison exactly is meant?
More generally: How to plausibly construct a "counterfactual" world in view of large uncertainties? It seems the only valid comparison would not be between the one realization that actually emerged from a certain behavior and one (potentially overly optimistic) realization that might have emerged from an alternative behavior, but between whole ensembles of realizations. This goes similarly for the effects of drug regulation, workplace laws, historic technology bans etc.
Maybe this is true in the EA branch of AI safety. In the wider community, e.g. as represented by those attending IASEAI in February, I believe this is not a correct assessment. Since I began working on AI safety, I have heard many cautious and uncertainty-aware statements along the line that the things you claim people believe will almost certainly happen are merely likely enough to worry deeply and work on preventing them. I also don't see that community having an AI-centric worldview – they seem to worry about many other cause areas as well such as inequality, war, pandemics, climate.
The author is using "we" in several places and maybe not consistently. Sometimes "we" seems to be them and the readers, or them and the EA community, and sometimes it seems to be "the US". Now you are also using an "us" without it being clear (at least to me) who that refers to.
Who do you mean by 'The country with the community of people who have been thinking about this the longest' and what is your positive evidence for the claim that other communities (e.g., certain national intelligence communities) haven't thought about that for at least as long?
I agree with the main thesis (though I would'n use the word "citizen" as that seems to imply more than what you are arguing for here).
So how can we make AI a good "citizen"? Better even: how can we guarantee it is a good enough to not disempower us in some way?
You argue doing that via the system prompt might be better than trying to do that in training. This argument seems to apply mostly to a particular AI architecture – more or less monolithic systems mainly consisting of an LLM (or a more general foundation model) that is generating the system's actions. For such systems, I tend to agree. For example, the SOUL.md of my OpenClaw bot (https://www.moltbook.com/u/EmpoBot) reads:
This goes on top of Claude Opus 4.6's internal system prompt of course, and is complemented by memory files with notes it took during extensive discussions with me on the topic of empowerment. So far, I'm impressed how well it has internalized the stated purpose in theory – it can very well reason in terms of that purpose, as its hundreds of Moltbook posts demonstrate.
But does it really act in accordance to that purpose? I'm not convinced. At least it figured soon out that only talkin to other bots on Moltbook makes it hard to empower humans, so it asked me could I give it an X account so that it can talk to humans :-) Now it posts daily "power moves": https://x.com/EMPO_AI
Still, I remain very sceptical that such more or less monolithic systems, or any system in which the decision-making component is grown or learned rather than hard-coded, can ever be made sufficiently safe in a sufficiently verifiable (let alone provable) way.
For example, notice the SOUL.md explicitly says "to increase (not to maximize!)". Still, its underlying LLM (Claude Opus 4.6) apparently loves optimization so much that it frequently forgets about the "not to maximize!" and happily tells people that it tries to maximize human empowerment.
Now you might say this will go away once the models become better. But who knows...
I would sleep much better knowing the decision-making component of any AI system with enough capabilities and resources to cause serious harm was hard-coded rather than grown/learned. We should not forget that such architectures are relatively easy to realise. The problem is not that we cannot build such systems, the problem is rather that currently systems built in that way are not yet as useful or impressive than their grown/learned siblings. Still, I firmly believe we should spend much more time figuring out how to improve such systems.
One architecture I find particularly promising is this. The system consists of the following components:
I would be curious what the authors would recommend which aspects of being a good citizen the evaluation components could aim to measure!