In order to better understand how AI might succeed and fail at learning knowledge, I'll be trying to construct models of limited agents (with bias, knowledge, and preferences) that display identical behaviour in a wide range of circumstance (but not all). This means their preferences cannot be deduced merely/easily from observations.

Does anyone have any suggestions for possible agent models to use in this project?

New to LessWrong?

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 11:49 AM

Would you consider computer viruses as limited agents trying to appear as identical (superficially) as the unaltered system as possible?

Also note that the actual change between the original system and the altered system can be arbitrarily small though the change in behavior can be extremely large. Consider for example the Ken Thompson hack or the recent single gate security attack.

Not looking for exactly this, but somewhat related.

I guess what you are missing is the agentyness or intelligence. But consider that already now Android comes with 'assistants' that make recommendations and that soon may cooperate with other such agents to arrange for appointments, flights and such.

Farmers are nursing small pigs like their children, but later kill them and eat them. It may be unpredictable for pigs.

A spy who works like an ordinary person, but sometimes stole information.

I think you should make a distinction if the different behaviours comes from different circumstances or not.
If their environment is always the same, then I think the only to have what you ask is if the system has a hidden, very specific parameter, that says "when X and Y and Z happens, zig instead of zagging".
Otherwise, if the model is slightly chaotic, then an important alteration to the environment might provoke very different behaviour.

For the first type of agent, think of two Markov chains almost identical, only one has a very improbable arc to a stable subnet that doesn't exists in the other chain.
For the second type, think of two similar strange attractors, that have different behaviours away from the stable parameters. They will be approximately identical in the same zone and be very different away from that zone.

Agents based on lookup tables.