I'm thinking the objective function could have constraints on the expected number of times the AI breaks the law, or the probability that it breaks the law, e.g.
- only actions with a probability of breaking any law < 0.0001 are permissible, or
- only actions for which the expected number of broken laws is < 0.001 are permissible.
There could also be separate constraints for individual laws or groups of laws, and these could depend on the severity of the penalties.
Looser constraints like this seem like they could avoid issues of lexicality and prioritizing avoidance of breaking the law over everything we want the AI to actually do, since the surest way to avoid breaking the law completely would be to never do anything (although we could also have a separate constraint for this).
Of course, the constraints should depend on breaking the law, not just being caught breaking the law, so the AI should predict whether or not it will break the law, not merely whether or not it will be caught breaking the law.
The AI could also predict whether or not it will break laws that don't exist now but will in the future (possibly even in response to its actions).
What are the challenges and problems with such an approach? Would it be too difficult to capture such constraints? Are laws too imprecise or ambiguous for this? Can we just have the AI consider multiple interpretations of the laws or try to predict how a human (or human judge) would interpret the law and apply it to its actions given the information the AI has?
How much work should the AI spend on estimating the probabilities that it will break laws?
What kinds of cases would it miss, say, given current laws?
Right, I was trying to factor this part out, because it seemed to me that the hope was "the law is explicit and therefore can be programmed in". But if you want to include all of the interpretative text and examples of real-world application, it starts looking more like "here is a crap ton of data about this law, please understand what this law means and then act in accordance to it", as opposed to directly hardcoding in the law.
Under this interpretation (which may not be what you meant) this becomes a claim that laws have a lot more data that pinpoints what exactly they mean, relative to something like "what humans want", and so an AI system will more easily pinpoint it. I'm somewhat sympathetic to this claim, though I think there is a lot of data about "what humans want" in everyday life that the AI can learn from. But my real reason for not caring too much about this is that in this story we rely on the AI's "intelligence" to "understand" laws, as opposed to "programming it in"; given that we're worried about superintelligent AI it should be "intelligent" enough to "understand" what humans want as well (given that humans seem to be able to do that).
I'm not sure what you're trying to imply with this -- does this make the AIs task easier? Harder? The generality somehow implies that the AI is safer?
Like, I don't get why this point has any bearing on whether it is better to train "lawyerlike AI" or "AI that tries to do what humans want". If anything, I think it pushes in the "do what humans want" direction, since historically it has been very difficult to create generalist AIs, and easier to create specialist AIs.
(Though I'm not sure I think "AI that tries to do what humans want" is less "general" than lawyerlike AI.)