I'm thinking the objective function could have constraints on the expected number of times the AI breaks the law, or the probability that it breaks the law, e.g.
- only actions with a probability of breaking any law < 0.0001 are permissible, or
- only actions for which the expected number of broken laws is < 0.001 are permissible.
There could also be separate constraints for individual laws or groups of laws, and these could depend on the severity of the penalties.
Looser constraints like this seem like they could avoid issues of lexicality and prioritizing avoidance of breaking the law over everything we want the AI to actually do, since the surest way to avoid breaking the law completely would be to never do anything (although we could also have a separate constraint for this).
Of course, the constraints should depend on breaking the law, not just being caught breaking the law, so the AI should predict whether or not it will break the law, not merely whether or not it will be caught breaking the law.
The AI could also predict whether or not it will break laws that don't exist now but will in the future (possibly even in response to its actions).
What are the challenges and problems with such an approach? Would it be too difficult to capture such constraints? Are laws too imprecise or ambiguous for this? Can we just have the AI consider multiple interpretations of the laws or try to predict how a human (or human judge) would interpret the law and apply it to its actions given the information the AI has?
How much work should the AI spend on estimating the probabilities that it will break laws?
What kinds of cases would it miss, say, given current laws?
Certainly you still need legal accountability -- why wouldn't we have that? If we solve alignment, then we can just have the AI's owner be accountable for any law-breaking actions the AI takes.
Imagine trying to make teenagers law-abiding. You could have two strategies:
1. Rewire the neurons or learning algorithm in their brain such that you can say "the computation done to produce the output of neuron X reliably tracks whether a law has been violated, and because of its connection via neuron Y to neuron Z, if an action is predicted to violate a law, the teenager won't take it".
2. Explain to them what the laws are (relying on their existing ability to understand English, albeit fuzzily), and give them incentives to follow it.
I feel much better about 2 than 1.
When you say "programming AI to follow law" I imagine case 1 above (but for AI systems instead of humans). Certainly the OP seemed to be arguing for this case. This is the thing I think is extremely difficult.
I am much happier about AI systems learning about the law via case 2 above, which would enable the AI police applications I mentioned above.
I suspect they are thinking about case 2 above? Or they might be thinking of self-driving car type applications where you have an in-code representation of the world? Idk, I feel confident enough of this that I'd predict that there is a miscommunication somewhere, rather than an actual strong difference of opinion between me and them.