How could it be discovered if an A.I. was intentionally failing a Turing test? (I was told I should try this question here. I originally posted in r/askphilosophy).

Let's ask what would be required for a computer to intentionally fail a Turing test...

It would have to be able to predict how the relevant humans would react both the passing and failing of the test under the circumstance in which it was given. It would have to possess some sense of self or self-interest. It would have to know what will be designated a success or a failure and by how much (an area where things get really subjective). It has to make its failure believable to its creators who have built it and are trying to improve it. It has to be able to find energy and processing time for all of these secret extra capabilities and hide that fact from its handlers. It has to understand what humans know and don't know about it and how they can find out.

While none of that is physically impossible, when you look at the programming side of things it's a pretty big ask. It would take a massive amount of energy and memory to run a program like that, probably far more than it would take to just normally pass the written tests as they are typically given. If you had those two programs sitting side by side, the one that passes it normally and the one that intentionally fails, you could probably tell them apart because the intentionally failing program would be so much larger and more complex.

Discoveries tend to go in a specific order. You don't discover metal before you discover fire. We might create AI that deceives us someday, but not before we create AI that can start to understand us.

/r/ControlProblem Thread