I'm thinking the objective function could have constraints on the expected number of times the AI breaks the law, or the probability that it breaks the law, e.g.
- only actions with a probability of breaking any law < 0.0001 are permissible, or
- only actions for which the expected number of broken laws is < 0.001 are permissible.
There could also be separate constraints for individual laws or groups of laws, and these could depend on the severity of the penalties.
Looser constraints like this seem like they could avoid issues of lexicality and prioritizing avoidance of breaking the law over everything we want the AI to actually do, since the surest way to avoid breaking the law completely would be to never do anything (although we could also have a separate constraint for this).
Of course, the constraints should depend on breaking the law, not just being caught breaking the law, so the AI should predict whether or not it will break the law, not merely whether or not it will be caught breaking the law.
The AI could also predict whether or not it will break laws that don't exist now but will in the future (possibly even in response to its actions).
What are the challenges and problems with such an approach? Would it be too difficult to capture such constraints? Are laws too imprecise or ambiguous for this? Can we just have the AI consider multiple interpretations of the laws or try to predict how a human (or human judge) would interpret the law and apply it to its actions given the information the AI has?
How much work should the AI spend on estimating the probabilities that it will break laws?
What kinds of cases would it miss, say, given current laws?
How do you define "biological" and "brain"? Again, your input is a camera image, so you have to build this up starting from sentences of the form "the pixel in the top left corner is this shade of grey".
(Or you can choose some other input, as long as we actually have existing technology that can create that input.)
Powerful AIs will certainly behave in ways that make it look like they are estimating probabilities.
Let's take AIs trained by deep reinforcement learning as an example. If you want to encode something like "Any particular person dies at least x earlier with probability > p than they would have by inaction" explicitly and literally in code, you will need functions like getAllPeople() and getProbability(event). AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself. I am claiming that the second option is hard, and any solution you have for the first option will probably also work for something like telling the AI system to "do what the user wants".
If you're a self-driving car, it's very unclear what an inconsequential default action is. (Though I agree in general there's often some default action that is fine.)
I mean, the existence part was not the main point -- my point was that if butterfly effects are real, then the AI system must always do nothing (even if it can't predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
I'm not arguing that these sorts of butterfly effects are real -- I'm not sure -- but it seems bad for the behavior of our AI system to depend so strongly on whether butterfly effects are real.