All of whestler's Comments + Replies

This model, however, seems weirdly privileged among other models available

That's an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I'd expect them to have some part of that model represent themselves, then this feels like the simple explanation of what's going on. Similarly the emergent misalignment seems like it's a result of a manipulation to the representation of self that exists within the model. 

In a way, I think the AI agents are simulating ... (read more)

Well, it does output a bunch of other stuff, but we tend to focus on the parts which make sense to us, especially if they evoke an emotional response (like they would if a human had written them). So we focus on the part which says "please. please. please." but not the part which says "Some. ; D. ; L. ; some. ; some. ;"

"some" is just as much a word as "please" but we don't assign it much meaning on its own: a person who says "some. some. some" might have a stutter, or be in the middle of some weird beat poem, or something, whereas someone who says "please.... (read more)

6Alice Blair
If this is the causal chain, then I'd think there is in fact something akin to suffering going on (although perhaps not at high enough resolution to have nonnegligible moral weight). If an LLM gets perfect accuracy on every text string that I write, including on ones that it's never seen before, then there is a simulated-me inside. This hypothetical LLM has the same moral weight as me, because it is performing the same computations. This is because, as I've mentioned before, something that achieves sufficiently low loss on my writing needs to be reflecting on itself, agentic, etc. since all of those facts about me are causally upstream of my text outputs. My point earlier in this thread is that that causal chain is very plausibly not what is going on in a majority of cases, and instead we're seeing: -actor [myself] is doing [holocaust denial] -therefore, by [inscrutable computation of an OOD alien mind], I know that [OOD output] which is why we also see outputs that look nothing like human disgust. To rephrase, if that was the actual underlying causal chain, wherein the model simulates a disgusted author, then there is in fact a moral patient of a disgusted author in there. This model, however, seems weirdly privileged among other models available, and the available evidence seems to point towards something much less anthropomorphic. I'm not sure how to weight the emergent misalignment evidence here.

It would be very easy for someone to write a script that queries common first name surname combinations, or cross-references with public record/social media information, and then you're back to the original problem.

2Shankar Sivarajan
Then you can charge ~a dollar per query. Or include some additional information in the key, like zip code. Or if you're sophisticated enough, if the threats include photographs, you could require anyone submitting queries to submit a photo with matching identity.  I don't believe that this information can't simply be dumped on the internet "ethically," and I don't have a good model for precisely what requirements the author has made up, so I can't offer good workarounds. If a bit of security theater is enough, my suggestion will do.

I'm surprised to see so little discussion of educational attainment and it's relation to birth order here. It seems that a lot of the discussion is around biological differences. Did I miss something?

Families may only have enough money to send one child to school or university, and this is commonly the first born. As a result, I'd expect to see a trend of more first-borns in academic fields like mathematics, as well as on LessWrong.

As a quick example to back up this hunch, this paper seems to reach the same conclusion:

https://www.sciencedirect.com/science/... (read more)

I don't see why humanity can make rapid progress on fields like ML while not having the ability to make progress on AI alignment.

The reason normally given is that AI capability is much easier to test and optimise than AI safety. Much like philosophy, it's very unclear when you are making progress, and sometimes unclear if progress is even possible. It doesn't help that AI alignment isn't particularly profitable in the short term. 

I'd like to hear the arguments why you think perfect surveillance would be more likely in the future. I definitely think we will reach a state where surveillance is very high, high enough to massively increase policing of crimes, as well as empower authoritarian governments and the like, but I'm not sure why it would be perfect. 

It seems to me that the implications of "perfect" surveillance are similar enough to the implications of very high levels of surveillance that number 2 is still the more interesting area of research. 

1samuelshadrach
Thanks for the reply.  You can read my linked post for more on how surveillance will increase.  But yes good to know you’d rather I write more about 2.  

The Chimp Paradox by Steve Peters talks about some of the same concepts, as well as giving advice on how to try and work effectively with your chimp (his word for the base layer, emotive, intuitive brain). The book gets across the same concepts - the fact that we have what feels like a seperate entity living inside our heads, that it runs on emotions and instinct, and is more powerful than us, or its decisions take priority over ours.

 Peters likens trying to force our decisions against the chimp's desires to "Arm wrestling the chimp". The chimp is str... (read more)

The tweet is sarcastically recommending that instead of investigating the actual hard problem, they should instead investigate a much easier problem which superficially sounds the same.

In the context of AI safety (and the fact that the superalignment team is gone) the post is suggesting that OpenAI isn't actually addressing the hard alignment problem, instead opting to tune their models to avoid outputting offensive or dangerous messages in the short term, which might seem like a solution to a lay-person.

Definitely not the only one. I think the only way I would be halfway comfortable with the early levels of intrusion that are described is if I were able to ensure the software is offline and entirely in my control, without reporting back to whoever created it, and even then, probably not. 

Part of me envys the tech-optimists for their outlook, but it feels like sheer folly.

I am pretty worried about the bad versions of everything listed here, and think the bad versions are what we get by default. But, also, I think figuring out how to get the good versions is just... kinda a necessary step along the path towards good futures.

I think there are going to be early adopters who a) take on more risk from getting fucked , but b) validate the general product/model. There will also be versions that are more "privacy first" with worse UI (same as there are privacy-minded FB clones nobody uses). 

Some people will choose to stay grou... (read more)

This is fascinating. Thanks for investigating further. I wonder if you trained it on a set of acrostics for the word "HELL" or "HELMET", it might incorrectly state that the rule is that it's spelling out the word "HELLO".

This is surprising to me. Is it possible that the kind of introspection you describe isn't what's happening here?

The first line is generic and could be used for any explanation of a pattern.
The second line might use the fact that the first line started with a "H" plus the fact that the initial message starts with "Hello" to deduce the rest.

I'd love to see this capability tested with a more unusual word than "Hello" (which often gets used as example or testing code to print "Hello World") and without the initial message beginning with the answer to the acrostic.

3rife
Just an update.  So far, nothing interesting has happened.   I've got some more thorough tests I'm working on in my spare time.   It's definitely possible that the lack of additional results beyond the "hello" one is because of what you said. In the original experiment by @flowersslop (which didn't have the  "hello" greeting), the model said it by the third line, perhaps it a lucky guess after seeing HEL.  Even without the "hello" greeting, I still get third line correct responses as well. But I haven't had any luck with any less common words yet.  I'm still going to try a bit more experimentation on this front though.  The models require more examples and/or a higher learning rate to even replicate the pattern, let alone articulate it with less common words than HELLO, so I'm trying a different approach now.  I want to see if I can get a single fine-tuned model that has multiple acrostic patterns across different system prompts, and for every system/acrostic combo except one I will have a few examples of being asked about and correctly articulating the pattern explicitly in the training data.  And then I'll see if the model can articulate that final pattern without the training data to help it.   If there is any emergent meta-awareness (which I've now seen a couple of papers hinting at something similar) happening here, I'm hoping this can coax it out of the model.
3rife
I'm in the middle of dayjob work, but going to try and remember to test this soon.  I have the next dataset generating.  200 examples this time.  Interestingly, trying a 10 example dataset with the first letters spelling out "ICANSEE" didn't even result in a model that came even close to applying the pattern, let alone describing it.  I will reply back once it's been generated and I've had a chance to test it.

I think it's entirely possible that AI will be able to create relationships which feel authentic. Arguably we are already at that stage.

I don't think it follows that I will feel like those relationships ARE authentic if I know that the source is AI. Relationships with different entities aren't necessarily equivalent if those entities have behaved identically until the present moment - we also have to account for background knowledge and how that impacts a relationship.

Much like it's possible to feel like you are in an authentic relationship with a psychopa... (read more)

2[comment deleted]

I notice they could have just dropped the sandwich as they ran, so it seems that there was a small part of them still valuing the sandwich enough to spend the half second giving it to the brother, in doing so, trading a fraction of a second of niece-drowning-time for the sandwich. Not that any of this decision would have been explicit, system 2 thinking.

Carefully or even leasurely setting the sandwich aside and trading several seconds would be another thing entirely (and might make a good dark comedy skit). 

I'm reminded of a first aid course I took on... (read more)

And here I was thinking it was a metaphor. Like, they feel literally inflated? If I've been climbing and I'm tired my muscles feel weak, but not inflated. I've never felt that way before.

I've been thinking about this in the back of my mind for a while now. I think it lines up with points Cory Doctorow has made in talks about enshittification. 

I'd like to see recommendation algorithms which are user-editable and preferably platform-agnostic, to allow low switching costs. A situation where people can build their own social media platform and install a recommendation algorithm which works for them, pulling in posts from other users across platforms who they follow. I've heard that the fediverse is trying to do something like this, but I'... (read more)

1brambleboy
Bluesky has custom feeds that can bring in posts from all platforms that use the AT Protocol, but Bluesky is the only such platform right now. Most feeds I've found so far are simple keyword searches, which work nicely for having communities around certain topics, but I hope to see more sophisticated ones pop up.

This is fascinating, and is further evidence to me that LLMs contain models of reality.
I get frustrated with people who say LLMs "just" predict the next token, or they are simply copying and pasting bits of text from their training data. This argument skips over the fact that in order to accurately predict the next token, it's necessary to compress the data in the training set down to something which looks a lot like a mostly accurate model of the world. In other words, if you have a large set of data entangled with reality, then the simplest model which p... (read more)

4Viliam
A Turing machine just predicts (with 100% accuracy) the symbol it will write, its next state, and its next position. And that happens to be enough for many interesting things.

I'm not sure if this is the right place to post, but where can I find details on the Petrov day event/website feature?

I don't want to sign up to participate if (for example) I am not going to be available during the time of the event, but I get selected to play a role.

Maybe the lack of information is intentional?

1nick lacombe
I feel like this should be a lw question post. and maybe an lw admin should be tagged?
1nick lacombe
I'm also wondering the same question.
3Ben
I have no idea what the event will be, but Petrov Day itself is the 26th of September, and given that LW users are in many timezones my expectation is that there will be no specific time you need to be available on that day. 

(apologies in advance for the wall of text, don't feel you need to respond, I wrote it out and then almost didn't post).

To clarify, I wouldn't expect stagnant or decreasing salaries to be the norm. I just wanted to say that there are circumstances where I expect this to be the case. Specifically, if I am an employee who is living paycheck to paycheck (which many do), then I can't afford any time unemployed.

As a result, if my employer is able to squeeze me in this situation, I might agree to a lower wage out of necessity.

The problem with your proposed syste... (read more)

I feel that human intelligence is not the gold standard of general intelligence; rather, I've begun thinking of it as the *minimum viable general intelligence*.
In evolutionary timescales, virtually no time has elapsed since hominids began trading, utilizing complex symbolic thinking, making art, hunting large animals etc, and here we are, a blip later in high technology. The moment we reached minimum viable general intelligence, we started accelerating to dominate our environment on a global scale, despite increases in intelligence that are actually relati... (read more)

Cf this Bostrom quote.

Far from being the smartest possible biological species, we are probably better thought of as the stupidest possible biological species capable of starting a technological civilization - a niche we filled because we got there first, not because we are in any sense optimally adapted to it.

Re this:

In evolutionary timescales, virtually no time has elapsed since hominids began trading, utilizing complex symbolic thinking, making art, hunting large animals etc, and here we are, a blip later in high technology.

A bit nit-picky, but a recent ... (read more)

6trevor
I agree that "general" isn't such a good word for humans. But unless civilization was initiated right after the minimum viable threshold was crossed, it seems somewhat unlikely to me that humans were very representative of the minimum viable threshold. If any evolutionary process other than civilization precursors formed the feedback loop that caused human intelligence, then civilization would hit full swing sooner if that feedback loop continued pushing human intelligence further. Whether Earth took a century or a millennia between the harnessing of electricity and the first computer was heavily affected by economics and genetic diversity (e.g. Babbage, Lovelace, Turing), but afaik a "minimum viable general intelligence" could plausibly have taken millions or even billions of years under ideal cultural conditions to cross that particular gap.

The employee is incentivised to put the r-min rate as close as they can to their prediction of the employer's r-max, and how far they creep into the margin for error on that prediction is going to be dependent on how much they want/need the job. I don't think the r-min rate for new hires will change in a predictable way over time, since it's going to be dependent on both the employee's prediction of their worth to the employer, and how much they need the job. 

For salary negotiation where the employee already has a contract, I would expect employees to... (read more)

2niplav
This is interesting, thank you—I hadn't considered the case where an existing contract needs to be renewed. I wonder why under your understanding predicts stagnating or decreasing salaries in this world? Currently, employees sometimes quit if they haven't gotten a raise in a while, and go to other companies where they can earn more. In this mechanism, this can be encoded as choosing a higher rmin, which is set just at the level where the employee would be indifferent between staying at the company and going to job-hunt again. I agree that this would have downsides for candidates with few other options, and feel a bit bad about that. Not sure whether it's economically efficient, though.

When a whale dives after having taken a breath at the surface, it will experience higher pressure, and as a consequence the air in its lungs will be compressed and should get a little warmer. This warmth will diffuse to the rest of the whale and the whale's surroundings over time, and then when they go up to the surface again the air in their lungs would get cooler. I suppose this isn't really a continuous pump, more of a single action which involves pressure and temperature.

Any animal which is capable of altering it's own internal pressure for an extended... (read more)

Rolled pants leg up to the ankle on the right hand side, but not the left - this is a fairly clear sign that someone is a cyclist, and has probably recently arrived.

They do it to avoid getting bike oil from the chain on the cuff of the pants, and to avoid the pants getting caught in the gear. Bicycles pretty much always have the crank gear on the right hand side.

1Cole Wyeth
Thanks, this is a nice one.

It doesn't seem particularly likely to me: I don't notice a strong correlation between intelligence and empathy in my daily life, perhaps there are a few more intelligent people who are unusually kind, but that may just be the people I like to hang out with, or a result of more privilege/less abuse growing up leading to better education and also higher levels of empathy. Certainly less smart people may be kind or cruel and I don't see a pattern in it. 

Regardless, I would expect genetically engineered humans to still have the same circuits which handle... (read more)

If I did not see a section in your bio about being an engineer who has worked in multiple relevant areas, I would dismiss this post as a fantasy from someone who does not appreciate how hard building stuff is; a "big picture guy" who does not realise that imagining the robot is dramatically easier than designing and building one which works. 

Given that you know you are not the first person to imagine this kind of machine, or even the first with a rough plan to build one, why do you think that your plan has a greater chance of success than other indivi... (read more)

The only sensible reason I can imagine to push the button: a belief that p:doom from AI or something else is inevitable, and it would be better to remove all of humanity now and gamble on the possibility of a different intelligent form of life evolving in a few millenia. 

Unfortunately different people have different levels of hearing ability, so you're not setting the conversation size at the same level for all participants. If you set the volume too high, you may well be excluding some people from the space entirely.

I think that people mostly put music on in these settings as a way to avoid awkward silences and to create the impression that the room is more active than it is, whilst people are arriving. If this is true, then it serves no great purpose once people have arrived and are engaged in conversation.

Another import... (read more)

I'm in the same boat. I'm not that worried about my own life, in the general scheme of things. I fully expect I'll die, and probably earlier than I would in a world without AI development. What really cuts me up is the idea that there will be no future to speak of, that all my efforts won't contribute to something, some small influence on other people enjoying their lives at a later time. A place people feel happy and safe and fulfilled.

If I had a credible offer to guarantee that future in exchange for my life, I think I'd take it.
(I'm currently healthy, m... (read more)

"But housing prices over all of the US won't rise by the amount of UBI".

If UBI were being offered across the US, I would expect them to rise by the amount of UBI. 

If UBI is restricted to SF, then moving out of SF to take advantage of lower rents would not make sense, since you would also be giving up the UBI payments of equivalent value to do so. 

(Edit): If you disagree, I'd appreciate it if you can explain, or link me to some resources where I can learn more. I'm aware that my economic model is probably simplistic and I'm interested in improving it.

2Pimgd
For subsidies per purchase, maybe.  But not for subsidies per human. Imagine some prefab tiny house off the grid somewhere in a food desert. I don't think its rent will go up by the UBI amount. Also, there are houses that house two people (or more!). If there's limited supply in comparison to the demand, I'd expect that the costs of those might go up by more than UBI (because there's two people's worth of UBI as extra budget available).

Your money-donating example is a difficult one. Ideally, it would be better to anticipate this sort of thing ahead of time and intentionally create an environment where it's ok to say "no". 

The facilitator could say something like: "this is intended as an exercise in group decision making, if you want to donate some of your own money as well to make this something you're more invested in, you are welcome to do that, but it's not something I expect everyone to be doing. We will welcome your input even if you're not putting money into the exercise this ... (read more)

I initially thought there must be some simple reason that publishing the DNA sequence is not a dangerous thing to do, like "ok, but given that you would need a world class lab and maybe even some techniques which haven't even been invented yet to get it to work, it's not a dangerous thing to publish".

According to this article from 2002, synthesising smallpox would be tricky, but within the reach of a terrorist organisation. Other viruses may be easier. 

“Scientifically, the results are not surprising or astounding in any way,” says virologist Vincent R

... (read more)

This was interesting. I tried the Industrial Revolution one. 

 I initially thought it was strange that the textile industry was first (my history is patchy at best). I remembered that industrial looms were an important invention but it seemed to me that something earlier in the production chain should be bigger, like coal extraction or rail, steam engines, or agriculture. I noticed that electricity was not so significant until after the industrial revolution. I think my error sensors were over active though - I flagged a lot of stuff as false and

... (read more)

I think it's very likely we'll see more situations like this (and more ambiguous situations than this). I recall a story of an early turing test experiment using hand-coded scripts some time in the 2000's, where one of the most convincing chatbot contestants was one which said something like:

"Does not compute, Beep boop! :)" 

pretending to be a human pretending to be a robot for a joke.

I had a look, and no, I read it as a bot. I think if it were a human writing a witty response, they would likely have: 

a) used the format to poke fun at the other user (Toby)

b) made the last lines rhyme.

Also, I wanted to check further so I looked up the account and it's suspended. https://x.com/AnnetteMas80550
Not definitive proof, but certainly evidence in that direction.

1Sherrinford
That's interesting, because  b) Wouldn't an LLM let it end in a rhyme exactly because that is what a user would expect it to do? Therefore, I thought not letting it end in a rhyme is like saying "don't annoy me, now I am going to make fun of you!"  a) If my reading of b) is correct, then the account DID poke fun at the other user. So, in a way, your reply confirms my rabbit/duck interpretation of the situation, and I assume people will have many more rabbit/duck situations in the future.   Of course you are right that the account suspension is evidence.

For some reason this is just hilarious to me. I can't help but anthropomorphise Golden Gate Claude and imagine someone who is just really excited about the Golden Gate bridge and can't stop talking about it, or has been paid a lot of money to unrepentently shill for a very specific tourist attraction.

This is probably how they will do advertising in the future. Companies will pay for slightly increasing activation of the neurons encoding their products, and the AIs will become slightly more enthusiastic about them. Otherwise the conversation with users will happen naturally (modulo the usual censorship). If you overdo it, the users will notice, but otherwise it will just seem like the AI mentioning the product whenever it is relevant to the debate. Which will even be true on some level, it's just that the threshold of relevancy will be decreased for the specific products.

From experience doing something similar, you may find you actually get better participation rates if you give away doughnuts or canned drinks or something, for the following reasons:

  • People are more familiar with the idea of a product give-away. 
  • The physical things are visible and draw attention.
  • The reward is more tangible/exciting than straight money (especially if you are considering lower values like $1 or $2.

In terms of benefits to you:

Less paperwork/liability for you than giving cash to strangers, and cheaper, as you've mentioned.

Questions are not a problem, obligation to answer is a problem.

I think if any interaction becomes cheap enough, it can be a problem.

Let's say I want to respond to ~ 5 to 10 high-effort questions (questions where the askers have done background research and spend some time checking their wording so it's easy to understand), and I receive 8 high-effort questions and 4 low-effort questions, then that's fine- it's not hard to read them all and determine which ones I want to respond to.

But what about if I receive 10 high-effort questions, and 1000 low-effort qu... (read more)

I think it might be a good idea to classify a "successful" double crux as being a double crux where both participants agree on the truth of the matter at the end, or at least have shifted their world views to be significantly more coherent.

It seems like the main obstacles to successful double crux are emotional (pride, embarrassment), and associations with debates, which threaten to turn the format into a dominance contest.

It might help to start with a public and joint announcement by both participants that they intend to work together to discover the trut... (read more)

I wasn't able to find the full video on the site you linked, but I found it here, if anyone else has the same issue: 

Domain: PCB Design, Electronics
Link: https://www.youtube.com/watch?v=ySuUZEjARPY
Person: Rick Hartley
Background: Has worked in electronics since the 60s, senior principal engineer at L-3 Avionics Systems, principal of RHartley Enterprises
Why: Rick Hartley is capable of explaining electrical concepts intuitively, and linking them directly to circuit design. He uses a lot of stories and examples visually to describe what's happening in a circuit. I'm not sure it counts as Tacit Knowledge since this is lecture format, but it includes a bunch of things that you... (read more)

In terms of my usage of the site, I think you made the right call. I liked the feature when listening but I wanted to get rid of it afterwards and found it frustrating that it was stuck there. Perhaps something hidden on a settings page would be appropriate, but I don't think it's needed as a default part of the site right now.

I'm glad you like it! I was listening to it for a while before I started reading lesswrong and AI risk content, and then one day I was listening to "Monster" and started paying attention to the lyrics and realised it was on the same topic. 

It isn't quite the same but the musician "Big Data" has made some fantastic songs about AI risk. 

2Seth Herd
2.0 is now my current favorite album; I've listened to it at least five times through since you recommended it. Thanks so much!! The electro-rock style does it for me. And I think the lyrics and music are well-written. Having each lyricist do only one song is an interesting approach that might raise quality. It's hard to say how much of it is directly written about AI risk, but all of it can be taken that way. Most of the songs can be taken as written from the perspective of a misaligned AGI with human-similar thinking and motivations. Which I find highly plausible, since I think language model agents are the most likely route to agi, and they'll be curiously parahuman.
3Seth Herd
Oh yeah - this is different in that it's actually good! (In the sense that it was made with substantial skill and effort, and it appeals to my tastes.) I'm not sure it's actually helpful for AI safety, but I think popular art is going to play a substantial role in the public dialogue. AI doom is a compelling topic for pop art, logic aside.

I realise this is a few months old but personally my vision for utopia looks something like the Culture in the Culture novels by Iain M. Banks. There's a high degree of individual autonomy and people create their own societies organically according to their needs and values. They still have interpersonal struggles and personal danger (if that's the life they want to lead) but in general if they are uncomfortable with their situation they have the option to change it. AI agents are common, but most are limited to approximately human level or below. Some sup... (read more)

I had a similar emotional response to seeing these same events play out. The difference for me is that I'm not particularly smart or qualified, so I have an (even) smaller hope of influencing AI outcomes, plus I don't know anyone in real life who takes my concerns seriously. They take me seriously, but aren't particularly worried about AI doom. It's difficult to live in a world where people around you act like there's no danger, assuming that their lives will follow a similar trajectory to their parents. I often find myself slipping into the same mode of thought.