Rather than having objective standards, I find a growth-centric approach to be most effective. Optimizing for output is easy to Goodheart, so as much as possible I treat more as a metric than a goal. It's important that I'm getting more done now than I was a year ago, for example, but I don't explicitly aim for a particular output on a day-to-day basis. Instead I aim to optimize my processes and improve my skills, which leads to increased output. That applies not just to good work performance, but many things.
score = (expected output * expected value of work per unit of output) / funding required
for each person that needs funding, sort the list in descending order, and allocate funding in order from top to bottom. You don't need to fully solve the knapsack problem here, because leftover funding can be carried over.Agreed, this was an expected result. It's nice to have a functioning example to point to for LLMs in an RLHF context, though.
From one perspective, nature does kind of incentivize cooperation in the long term. See The Goddess of Everything Else.
Is there a reason to believe this is likely? Outside of a strong optimization pressure for niceness (of which there is definitely some, but relative to other optimization pressures it's relatively weak) I'd expect these organizations to be of roughly average possible niceness for their situation.
A quick Google search of probe tuning doesn't turn up anything. Do you have more info on it?
Probe-tuning doesn't train on LLM's own "original rollouts" at all, only on LLM's activations during the context pass through the LLM.
This sounds like regular fine tuning to me. Unless you mean that the loss is calculated based on one (multiple?) of the network's activations rather than on the output logits.
Edit: I think I get what you mean now. You want to hook a probe to a model and fine-tune it to perform well as a probe classifier, right?
It's also possible that there is some elegant, abstract "intelligence" concept (analogous to arithmetic) which evolution built into us but we don't understand yet and from which language developed. It just turns out that if you already have language, it's easier to work backwards from there to "intelligence" than to build it from scratch.
This probably isn't the case, but I secretly wonder if the people in camp #1 are p-zombies.
Not very familiar with US culture here: is AI safety not extremely blue-tribe coded right now?
How does the logic here work if you change the question to be about human history?
Guessing a 50/50 coin flip is obviously impossible, but if Omega asks whether you are in the last 50% of "human history" the doomsday argument (not that I subscribe to it) is more compelling. The key point of the doomsday argument is that humanity's growth is exponential, therefore if we're the median birth-rank human and we continue to grow, we don't actually have that long (in wall-time) to live.
Some people have strong negative priors toward AI in general.
When the GPT-3 API first came out, I built a little chatbot program to show my friends/family. Two people (out of maybe 15) flat out refused to put in a message because they just didn't like the idea of talking to an AI.
I think it's more of an instinctual reaction than something thought through. There's probably a deeper psychological explanation, but I don't want to speculate.