I'm thinking the objective function could have constraints on the expected number of times the AI breaks the law, or the probability that it breaks the law, e.g.
- only actions with a probability of breaking any law < 0.0001 are permissible, or
- only actions for which the expected number of broken laws is < 0.001 are permissible.
There could also be separate constraints for individual laws or groups of laws, and these could depend on the severity of the penalties.
Looser constraints like this seem like they could avoid issues of lexicality and prioritizing avoidance of breaking the law over everything we want the AI to actually do, since the surest way to avoid breaking the law completely would be to never do anything (although we could also have a separate constraint for this).
Of course, the constraints should depend on breaking the law, not just being caught breaking the law, so the AI should predict whether or not it will break the law, not merely whether or not it will be caught breaking the law.
The AI could also predict whether or not it will break laws that don't exist now but will in the future (possibly even in response to its actions).
What are the challenges and problems with such an approach? Would it be too difficult to capture such constraints? Are laws too imprecise or ambiguous for this? Can we just have the AI consider multiple interpretations of the laws or try to predict how a human (or human judge) would interpret the law and apply it to its actions given the information the AI has?
How much work should the AI spend on estimating the probabilities that it will break laws?
What kinds of cases would it miss, say, given current laws?
That's what you want, but the sentence "Maximize paperclips" doesn't imply it through any literal interpretation, nor does "Maximize paperclips" imply "maximize paperclips while killing at least one person". What I'm looking for is logical equivalence, and adding qualifiers about whether or not people are killed breaks equivalence.
I think much more is hidden in "good", which is something people have a problem specifying fully and explicitly. The law is more specific and explicit, although it could be improved significantly.
That's true. I looked at the US Code's definition of manslaughter and it could, upon a literal interpretation, imply that helping someone procreate is manslaughter, because bringing someone into existence causes their death. That law would have to be rewritten, perhaps along the lines of "Any particular person dies at least x earlier with probability > p than they would have by inaction", or something closer to the definition of stochastic dominance for time of death (it could be a disjunction of statements). These are just first attempts, but I think they could be refined enough to capture a prohibition on killing humans to our satisfaction, and the AI wouldn't need to understand vague and underspecified words like "good".
We would then do this one by one for each law, but spend a disproportionate amount of time on the more important laws to get them right.
(Note that laws don't cover nonidentity cases, as far as I know.)