Will_Pearson

Wikitag Contributions

Comments

Sorted by

I'm thinking about secret projects that might be info hazardous to each other but still might need information from each other so the connections are by necessity tenuous and transitory. Is that a topic that has been explored before?

Has anyone explored using neural clusters found by mechanistic interpretability as part of a goal system? 

So that you would look for clusters for certain things e.g. happiness or autonomy and have that neural clusters in the goal system.  If the system learned over time it could refine that concept. 

This was inspired by how human goals seem to have concepts that change over time in them. 

I was reading multi agent risks from advanced AI  and I was thinking about the section on emergent capabilities. It seems that multi agent capabilities might be greater than the sum of its parts especially where simulation and agency is involved. 

Perhaps the agent might be put in a simulation where illegal actions are the most relevant and it is being told that it is being used to red team against the scenario. This could be used to discover plans for bad actions and potentially get round filters for explicit prompts of bad actions.

How to filter these simulation like prompts out so that they should only be used by legitimate entities to actually red team things might be an important question.

An interesting thing to do with this kind of approach is apply it to humans (or partially human systems) and see if it makes sense and it makes a sensible society. Try and universalise the concept, because humans out partial human systems can pose great threats, e.g potentially memetic like Marxism . If it doesn't make sense of leads to problems like who controls the controllers (itself a potentially destabilising idea if applied on the large scale) then it might need to be rethought.

True, I was thinking there would be gates to participation in the network that would indicate the skill or knowledge level of the participants without indicating other things about their existence.  So if you put gates/puzzles in their  way s to participation uch that only people that could generate reward you if they so choose to cooperate could pass it, that would dangle possible reward in front of you.

Has anyone been thinking about how to build trust and communicate in a dark forest scenario by making plausibly deniable broadcasts and plausibly deniable reflections of those broadcasts. So you don't actually know who ior how many people you might be talking to

Sorry formatting got stripped and I didn't notice

Load More