Comment Permalink

How should AI alignment and autonomy preservation intersect in practice?

We know that AI alignment research has made significant progress in embedding internal constraints that prevent models from manipulating, deceiving, or coercing users (to the extent that they don’t). However, internal alignment mechanisms alone don’t necessarily give users meaningful control over AI’s influence on their decision-making. Which is a mechanistic problem on its own, but…

This raises a question: Should future AI systems be designed to not only align with human values but also expose their influence in ways that allow users to actively contest and reshape AI-driven inferences?

For example:

If an AI model generates an inference about a user (e.g., “this person prefers risk-averse financial decisions”), should users be able to see, override, or refine that inference?
If an AI assistant subtly nudges users toward certain decisions, should it disclose those nudges in a way that preserves user autonomy?
Could mechanisms like adaptive user interfaces (allowing users to adjust how AI explains itself) or AI-generated critiques of its own outputs serve as tools for reinforcing autonomy rather than eroding it?

I’m exploring a concept I call Autonomy by Design, a control-layer approach that builds on alignment research but adds external, user-facing mechanisms to make AI’s reasoning and influence more contestable.

Would love to hear from interpretability experts, and UX designers: Where do you see the biggest challenges in implementing user-facing autonomy safeguards? Are there existing methodologies that could be adapted for this purpose?

Thank you in advance.

Feel free to shatter this if you must XD.

See in context

Katalina Hernandez's Quick takes

by Katalina Hernandez

This is a special post for quick takes by Katalina Hernandez. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Sorted by

New & upvoted

Click to highlight new quick takes since: Today at 9:58 PM

Katalina Hernandez11h5

AI safetyShow more

The AI alignment community had a major victory in the regulatory landscape, and it went unnoticed by many.

The EU AI Act explicitly mentions "alignment with human intent" as a key focus area in relation to regulation of systemic risks.

As far as I know, this is the first time “alignment” has been mentioned by a law, or major regulatory text.

It’s buried in Recital 110, but it’s there. And it also makes research on AI Control relevant:

"International approaches have so far identified the need to pay attention to risks from potential intentional misuse or unintended issues of control relating to alignment with human intent".

This means that alignment is now part of the EU’s regulatory vocabulary.

But here’s the issue: most AI governance professionals and policymakers still don’t know what it really means, or how your research connects to it.

I’m trying to build a space where AI Safety and AI Governance communities can actually talk to each other.

If you're curious, I wrote an article about this, aimed at the corporate decision-makers that lack literacy on your area.

Would love any feedback, especially from folks thinking about how alignment ideas can scale into the policy domain.

Here is the Substack link (I also posted it on LinkedIn):

https://open.substack.com/pub/katalinahernandez/p/why-should-ai-governance-professionals?utm_source=share&utm_medium=android&r=1j2joa

My intuition says that this was a push from Future of Life Institute.

Thoughts? Did you know about this already?

Katalina Hernandez2mo3

AI safetyShow more

How should AI alignment and autonomy preservation intersect in practice?

For example:

If an AI model generates an inference about a user (e.g., “this person prefers risk-averse financial decisions”), should users be able to see, override, or refine that inference?
If an AI assistant subtly nudges users toward certain decisions, should it disclose those nudges in a way that preserves user autonomy?
Could mechanisms like adaptive user interfaces (allowing users to adjust how AI explains itself) or AI-generated critiques of its own outputs serve as tools for reinforcing autonomy rather than eroding it?

Thank you in advance.

Feel free to shatter this if you must XD.