Here's a fun one. If there's a distribution of how morally correct people are.
Imagine that makes sense for a second, picking whatever definition you like.
Where do you think you are on it?
Maybe in the right direction. People tend to think they're doing the right thing.
How do you know if you're wrong, how do you know what the person who's furthest along in the "right direction" sees?
Subjective opinion: I would probably be happy do a few more past the limit even without the automated explanation score.
If for instances, only the first 10-20 had an automatic score, and the rest where dependent on manual votes (starting at score 0), I think the task would still be interesting, hopefully without causing too much trouble.
(There's also a realistic chance that I may lose attention and forget about the site in a few weeks, despite liking the game, so that (disregarding spam problems) a higher daily limit with reduced functionality maybe has value?)
(As a note, the Discord link also seems to be expired or invalid on my end)
The game is addictive on me, so I can't resist an attempt at describing this one, too :)
It seems related to grammar, possibly looking for tokens on/after articles and possessives
My impression from trying out the game is that most neurons are not too hard to find plausible interpretations for, but most seem to have low-level syntactical (2nd token of a work) or grammatical (conjunctions) concerns.
Assuming that is a sensible thing to ask for, I would definitely be interested in an UI that allows working with the next smallest meaningful construction that features more than a single neuron.
Some neurons seem to have 2 separate low-level patterns that cannot clearly be tied together. This suggests they may have separate "graph neighbors" that rely on them for 2 separate concerns. I would like some way to follow and separate what neurons are doing together, not just individually, if that makes any sense =)
(As an aside, I'd like to apologize that this isn't directly responding to the residuals idea. I'm not sure I know what residuals are, though the description of what can be done with it seems promising, and I'd like to try the other tool when it comes online!)
Note that the x = 'x = {!r};print(x.format(x))';print(x.format(x))
pattern is described on the Rosetta Code page for quines. It's possible that the trick is well known and that GPT4 was able to reach for it.
(I don't know that I would call the resulting code copied. If I were given this prompt, the extra requirements would make me interpret it as a "show me you can fulfill these specific requirements" exercise rather than an exercise specifically about finding the trick. So, reaching for a pattern seems the appropriately level of lazy, the trick feels sufficiently less central to the question that I like the 'applying a pattern' label better than copying)
As a very new user, I'm not sure if it's still helpful to add a data point if user testing's already been done, but it seems at worst mostly harmless.
I saw the mod note before I started using the votes on this post. My first idea was to Google the feature, but that returned nothing relevant (while writing this post, I did find results immediately through site search). I was confused for a short while trying to place the axes & imagine where I'd vote in opposite directions. But after a little bit of practice looking at comments, it started making sense.
I've read a couple comments on this article that I agree with, where it seems very meaningful for me to downvote them (I interpret the downvote's meaning when both axes are on as low quality, low importance, should be read less often).
I relatively easily find posts I want to upvote on karma. But for posts that I upvote, I'm typically much less confident about voting on agreement than for other posts (as a new user, it's harder to assess the specific points made in high quality posts).
Posts where I'm not confident voting on agreement correlate with posts I'm not confident I can reply to without lowering the level of debate.
Unfortunately, the further the specific points that are made are from my comfort/knowledge zone, the less I become able to tell nonsense from sophistication.
It seems bad if my karma vote density centers on somewhat-good posts at the exclusion of very good and very bad posts. This makes me err on the side of upvoting posts I don't truly understand. I think that should be robust, since new user votes seem to carry less weight and I expect overrated nonsense to be corrected quickly, but it still seems suboptimal.
It's also unclear to me whether agreement-voting factors in the sorting order. I predict it doesn't, and I would want to change how I vote if it did.
Overall, I don't have a good sense of how much value I get out of seeing both axes, but on this post I do like voting with both. It feels a little nicer, though I don't have a strong preference.
About the usual example being "burn all GPUs", I'm curious whether it's to be understood as purely a stand-in term for the magnitude of the act, or whether it's meant to plausibly be in solution-space.
An event of "burn all GPU" magnitude would have political ramifications. If you achieve this as a human organization with human means, i.e. without AGI cooperation, it seems violence on this scale would unite against you, resulting in a one-time delay.
If the idea is an act outside the Overton Window, without AGI cooperation, shouldn't you aim to have the general public and policymakers united against AGI, instead of against you?
Given that semi manufacturing capabilities required to make GPU or TPU-like chips are highly centralized, there being only three to four relevant fabs left, restricting AI hardware access may not be enough to stop bad incentives indefinitely for large actors, but it seems likely to gain more time than a single "burn all GPUs" event.
For instance, killing a {thousand, fifty-thousand, million} people in a freak bio-accident seems easier than solving alignment. If you pushed a weak AI into the trap and framed it for falling into it, would that gain more time through policymaking than destroying GPUs directly (still assuming a pre-AGI world)?
I think I'm trying to say something about the place where the tails don't yet diverge too far yet, as if there is some sort of rough consensus morality zone where people might disagree on details but agree sufficiently with each other to disagreeing with me, about myself. But maybe I'm confused and that's still incoherent (sorry)!
I wonder if I might have a wrong impression of myself. Maybe people who I see doing ethical things that I don't do, who I would put above me in my personal version of an ethics distribution, would form a rough consensus where I'm in a worse percentile on average than I'd have guessed
I'm holding to this idea that it's meaningful to consider in what percentile other people would put you, according to their own metric, and then do statistics over that, but I am neither a statistician nor a clever person so I'd be happy to be corrected :)