<note> I work in Clarity at OpenAI. Chris and I have discussed this response (though I cannot claim to represent him).
Does "faithful" mean "100% identical in terms of I/O", or more like "captures all of the important elements of"?
I'd say faithfulness lies on a spectrum. Full IO determinism on a neural network is nearly impossible (given the vagaries of floating point arithmetic), but what is really of interest to us is “effectively identical IO”. A working definition of this could be - an interpretable network that one that can act as a drop-in replacement the original network with no impact on final accuracy.
This allows some wiggle room in the weights - to... (read more)
<note> I work in Clarity at OpenAI. Chris and I have discussed this response (though I cannot claim to represent him).
I'd say faithfulness lies on a spectrum. Full IO determinism on a neural network is nearly impossible (given the vagaries of floating point arithmetic), but what is really of interest to us is “effectively identical IO”. A working definition of this could be - an interpretable network that one that can act as a drop-in replacement the original network with no impact on final accuracy.
This allows some wiggle room in the weights - to... (read more)