Either that, or the idea of mind reading agents is flawed.
We shouldn't conclude that, since to various degrees mindreading agents already happen in real life.
If we tighten our standard to "games where the mindreading agent is only allowed to predict actions you'd choose in the game, which is played with you already knowing about the mindreading agent", then many decision theories that are different in other situations might all choose to respond to "pick B or I'll kick you in the dick" by picking B.
E(U|NS) = 0.8, E(U|SN) = 0.8
Are the best options from a strict U perspective, and exactly tie. Since you've not included mixed actions, the agent must arbitrarily pick one, but arbitrarily picking one seems like favouring an action that is only better because it affects the expected outcome of the war, if I've understood correctly?
I'm pretty sure this is resolved by mixed actions though: The agent can take the policy {NS at 0.5, SN at 0.5}, which also gets U of 0.8 and does not effect the expected outcome of the war, and claim supreme unbiasedness fo...
https://www.yudbot.com/
Theory: LLMs are more resistent to hypnotism-style attacks when pretending to be Eliezer, because failed hypnotism attempts are more plausible and in-distribution, compared to when pretending to be LLMs where both prompt-injection attacks and actual prompt-updates seem like valid things that could happen and succeed.
If so, to make a more prompt-injection resistent mask, you need a prompt chosen to be maximally resistent to mind control, as chosen from the training data of all english literature, whatever that might be. The kind of en...
In general I just don't create mental models of people anymore, and would recommend that others don't either.
That seems to me prohibitive to navigating social situations or even long term planning. When asked to make a promise, if you can't imagine your future self in the position of having to follow through on it, and whether they do that, how can you know you're promising truthfully or dooming yourself to a plan you'll end up regretting? When asking someone for a favor, do you just not imagine how it'd sound from their perspective, to try and predict if ...
The upside and downside both seem weak when it's currently so easy to bypass the filters. The probability of refusal doesn't seem like the most meaningful measure, but I agree it'd be good for the user to get explicitly flagged whether the trained non-response impulse was activated or not, instead of having to deduce it from the content or wonder if the AI really thinks that it's non-answer is correct.
Currently, I wouldn't be using any illegal advise it gives me for the same reason I wouldn't be using any legal advise it gives me, the model just isn't smar...
The regular counterfactual part as I understand it is:
"If I ignore threats, people won't send me threats"
"I am an agent who ignores threats"
"I have observed myself recieve a threat"
You can at most pick 2, but FDT needs all 3 to justify that it should ignoring it.
It wants to say "If I were someone who responds to threats when I get them, then I'll get threats, so instead I'll be someone who refuses threats when I get threats so I don't get threats" but what you do inside of logically impossible situations isn't well defined.
The logical counterfactual part i...
https://www.lesswrong.com/tag/functional-decision-theory argues for choosing as if you're choosing the policy you'd follow in some situation before you learnt any of the relevant infortmation. In many games, having a policy of making certain choicese (that others could perhaps predict, and adjust their own choices accordingly) gets you better outcomes then just always doing what seems like a good idea ta the time. For example if someone credibly threatens you might be better off paying them to go away, but before you got the threat you would've prefered to...
I totally agree we can be coherently uncertain about logical facts, like whether P=NP. FDT has bigger problems then that.
When writing this I tried actually doing the thing where you predict a distribution, and only 21% of LessWrong users were persuaded they might be imaginary and being imagined by me, which is pretty low accuracy considering they were in fact imaginary and being imagined by me. Insisting that the experience of qualia can't be doubted did come up a few times, but not as aggressively as you're pushing it here. I tried to cover it in the "hig...
If you think you might not have qualia, then by definition you don't have qualia.
What? just a tiny bit of doubt and your entire subjective conscious experience evaporates completely? I can't see any mechanism that would do that, it seems like you can be real and have any set of beliefs or be fictional and have any set of beliefs. Something something map-territory distinction?
This just seems like a restatement of the idea that we should act as if we were choosing the output of a computation.
Yes, it is a variant of that idea, with different justifications th...
Actually, I don't assume that, I'm totally ok with believing ghosts don't have qualia. All I need is that they first-order believe they have qualia, because then I can't take my own first-order belief I have qualia as proof I'm not a ghost. I can still be uncertain about my ghostliness because I'm uncertain in the accuracy of my own belief I have qualia, in explicit contradiction of 'cogito ergo sum'. The only reason ghosts possibly having qualia is a problem is that then maybe I have to care about how they feel.
A valid complaint. I know the answer must be something like "coherent utility functions can only consist of preferences about reality" because if you are motivated by unreal rewards you'll only ever get unreal rewards, but that argument needs to be convincing to the ghost too, whose got more confidence in her own reality. I know that e.g. in Bomb ghost-theory agents choose the bomb even if they think the predictor will simulate them a painful death, because they consider the small amount of money at much greater measure for their real selves to be worth it, but I'm not sure how they get to that position.
They can though? Bomb box 5, incentivise box 1 or 2, bet on box 3 or 4. Since FDT's strategy puts rewarding cooperative hosts above money or grenades, she picks the box that rewards the defector and thus incentivises all 4 to defect from that equilibrium. (I thought over her strategy for quite a bit and there are probably still problems with it but this isn't one of them)
How much of this effect is from morality being causally contagious (associating with Evil people turns you Evil) vs. morality being evidientarily contagious (Evil people are more likely to choose to associate with Evil people)?
I'd expect that, all else being equal, organisations secretly run in evil ways will be more willing to secretly accept money from other evil people, for many reasons including that they've got higher expectation of how normal that sort of behaviour is. It seems harder to imagine how a good organisation choosing to take dirty money wo...
If the hosts move first logically, then TDT will lead to the same outcomes as CDT, since it's in each host's interest to precommit to incentivising the human to pick their own box
It's in the hosts interests to do that if they think the player is CDT, but it's not in their interests to commit to doing that. They don't lose anything by retaining the ability to select a better strategy later after reading the players mind.
Not whichever is lighter, one of whichever pair is heavier. Yes, I claim an EDT agent upon learning the rules, if they have a way to blind themself but not to force a commitment, will do this plan. They need to maximise the amount of incentive that the hosts have to put money in boxes, but to whatever extent they actually observe the money in boxes they will then expect to get the best outcome by making the best choice given their information. The only middle ground I could find is pairing the boxes, and picking one from the heavier pair. I'd be very happy to learn if I was wrong and there's a better plan or if this doesn't work for some reason.
EDT isn't precommiting to anything here, she does her opinion of the best choice at every step. That's still a valid complaint that it's unfair to give her a blindfold though. If CDT found out about the rules of the game before the hosts made their predictions, he'd make himself an explosive collar that he can't take off and that automatically kills him unless he chooses the box with the least money, and get the same outcome as FDT, and EDT would do that as well. For the blindfold strategy EDT only needs to learn the rules before she sees whats in the boxe...
The hosts aren't competing with the human, only each other, so even if the hosts move first logically they have no reason or opportunity to try to dissuade the player from whatever they'd do otherwise. FDT is underdefined in zero-sum symmetrical strategy games against psychological twins, since it can foresee a draw no matter what, but choosing optimal strategy to get to the draw still seems better than playing dumb strategies on purpose and then still drawing anyway.
Why do you think they should be $100 and $200? Maybe you could try simulating it?
What happ...
Yes, that's the intended point, and probably a better way of phrasing it. I am concluding against the initial assertion, and claiming that it does make sense to trust people in some situations even though you're implementing a strategy that isn't completely immune to exploitation.
I don't consider the randomized response technique lying, it's mutually understood that their answer means "either I support X or both coins came up heads" or "either I support Y or both coins came up tails". There's no deception because you're not forming a false belief and you both know the precise meaning of what is communicated.
I don't consider penetration testing lying, you know that penetration testers exist and have hired them. It's a permitted part of the cooperative system, in a way that actual scam artists aren't.
What's a word that means "antisoc...
Not sure what's unclear here? I mean that you'd generally prefer not to have incentive structures where you need true information from other people and they can benefit at your loss by giving you false information. Paying someone to lie to you means creating an incentive for them to actually decieve you, not merely giving them money to speak falsehoods.
A sufficiently strong world model can answer the question "What would a very smart very good person think about X?" and then you can just pipe that to the decision output, but that won't get you higher intelligence than what was present in the training material.
Shouldn't human goals have to be in the within human intelligence part, since humans have them? Or are we considering exactly human intelligence AI unsafe? Do you imagine a slightly dumber version of yourself failing to actualise your goals from not having good strategies, or failing to even embed t...
You're far more likely to be a background character than the protagonist in any given story, so a theory claiming you're the most important person in a universe with an enormous number of people has an enormous rareness penalty to overcome before you should believe it instead of that you're just insane or being lied to. Being in a utilitarian high-leverage position for the lives of billions can be overcome by reasonable evidence, but for the lives of 3^^^^3 people the rareness penalty is basically impossible to overcome. Even if the story is true, most of ...
You can dodge it by having a bounded utilityfunction, or if you're utilitarian and good a function that is at most linear in anthropic experience.
If the mugger says "give me your wallet and I'll cause you 3^^^^3 units of personal happiness" you can argue that's impossible because your personal happiness doesn't go that high.
If the mugger says "give me your wallet and I'll cause 1 unit of happiness to 3^^^^3 people who you altruistically care about" you can say that, in the possible world where he's telling the truth, there are 3^^^^3 + 1 people only one of...
don't make any trade that looks dumb
Ah, well, there you go then.
Hello, I read much of the background material over the past few years but am a new account. Not entirely sure what linked me here first but 70% guess is HPMoR. compsci / mathematics background. I have mostly joined due to having ideas slightly too big for my own brain to check and wanting feedback, wanting to infect the community that I get a lot of my memes from with my own original memes, having now read enough to feel like LessWrong is occasionally incorrect about things where I can help, and to improve my writing quality in ways that generalise to explaining things to other people.
If you instead say "evidence of", this makes more sense
Accepted and changed, I'm only claiming some information/entanglement, not absolute proof.
It applies to all transactions (because all transactions are fundamentally about trust)
Would it be clearer to say "markets with perfect information"? The problem I'm trying to describe can only occur with incomplete information / imperfect trust, but doesn't require so little information and trust that transactions become impossible in general. There's a wide middleground of imperfect trust where all of real life ...
Weighted by credence means you're scored on Probability*Prediction, which isn't a fair rule.
If I sincerely believe its 60:40 between two options, and I write that down, I expect 36+16=52 payout, but if I write 100:0 I expect 60+0=60 payout, putting more credence on higher probability outcomes gets me more money even in excess of my true beliefs.
Valid, I'm still working on writing up properly the version with full math which is much more complicated, without that math and without payment it consists of people telling their beliefs and being mysteriously believed about them, because everyone knows everyones incentivised to be honest and sincere and the Agreement Theorem says that means they'll agree when they all know everyone elses reasoning.
Possible Example that I think is the minimum case for any kind of market information system like this:
weather.com wants accurate predictions 7 days in advance...
To entertain that possibility, suppose you're X% confident that your best "fool the predictor into thinking I'll one-box, and then two-box" plan will work, and Y% confident that "actually do one-box, in a way the predictor can predict" plan will work. If X=Y or X>Y you've got no incentive to actually one-box, only to try to pretend you will, but above some threshold of belief the predictor might beat your deception it makes sense to actually be honest.