abramdemski

Sequences

Pointing at Normativity
Implications of Logical Induction
Partial Agency
Alternate Alignment Ideas
Filtered Evidence, Filtered Arguments
CDT=EDT?
Embedded Agency
Hufflepuff Cynicism

Comments

Sorted by

My objection is that Smoking Lesion is a decision problem which we can't drop arbitrary decision procedures into to see how they do; its necessary that the decision procedure might have a lesion influencing it. If you drop an CDT decision procedure into the problem, then the claimed population statistics can't apply to you, since CDT always smokes in this problem - either you're a CDT mutant and would be mistaken to apply the population statistics to yourself, or everyone is CDT and the population statistics can't be as claimed. Similarly with EDT. Therefore, to me, this decision problem isn't a legitimate test of a decision procedure: you can only test the decision procedure by lying to it about the problem (making it believe the problem-statement statistics are representative of it), or by mangling the decision procedure (adding a lesion into it somehow).

There are a lot of replies here, so I'm not sure whether someone already mentioned this, but: I have heard anecdotally that homosexual men often have relationships which maintain the level of sex over the long term, while homosexual women often have long-term relationships which very gradually decline in frequency of sex, with barely any sex after many decades have passed (but still happily in a relationship).

This mainly argues against your model here:

This also fits with my general models of mating markets: women usually find the large majority of men sexually unattractive, most women eventually settle on a guy they don't find all that sexually attractive, so it should not be surprising if that relationship ends up with very little sex after a few years.

It suggests instead that female sex drive naturally falls off in long-term relationships in a way that male sex drive doesn't, with sexual attraction to a partner being a smaller factor.

ou need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy - just add a small weight on a uniform prior over tokens like they did in old school atari RL.)

Yeah, what I really had in mind with "avoiding mode collapse" was something more complex, but it seems tricky to spell out precisely.

Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn't be able to do something to avoid exploring it.

It's an interesting point, but where does the "extremely" come from? Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I'm unclear on the rules of the game as you're imagining them.

  • Training on high quality human examples: One basic starting point is to initialize RL by training the AI to imitate on-distribution high quality human examples such that it has to explore into strategies which are at least as good as these human trajectories. We could also try to construct such examples adversarially (e.g., train AIs to point out errors we adversarially insert).
  • More granular feedback and trajectory edits: In some cases, constructing good enough imitation trajectories may be challenging, for instance, it seems very hard for humans to construct a long reasoning trace for a reasoning model (though it is unclear how good the human examples would need to be to force exploration).

Overall, the quality of these proposals basically depends on their ability to explore well. Exploring well obviously doesn't just mean injecting more randomness, since a high-entropy approach will take a very very long time to find useful strategies in complex domains. Training on high-quality human examples avoids this problem by ensuring that we at least explore strategies often employed by human experts, which are going to be some of the better strategies.

However, if the AI is being trained to (even mildly) superhuman levels of capability (on a given subject matter), then clearly human imitation does not do much good anymore; the AI is free to sandbag at mildly superhuman levels of capability (within the given subject matter). Training on high-quality human examples might even cause or exacerbate this (basically, giving the AI the idea that it could imitate humans, which can then be a good sandbagging strategy).

So, basically, training to a given capability level (while robustly avoiding sandbagging) seems to require exploration ability near that specific capability level

The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses). 

I really like the 'trying not to know' one, because there are lots of things I'm trying not to know all the time (for attention-conservation reasons), but I don't think I have very good strategies for auditing the list.

I'm thinking about AI emotions. The thing about human emotions and expressions is that they're more-or-less involuntary. Facial expressions, tone of voice, laughter, body language, etc reveal a whole lot about human inner state. We don' know if we can trust AI emotional expressions in the same way; the AIs can easily fake it, because they don't have the same intrinsic connection between their cognitive machinery and these ... expressions.

A service called Face provides emotional expressions for AI. It analyzes AI-generated outputs and makes inferences about the internal state of the AI who wrote the text. This is possible due to Face's interpretability tools, which have interpreted lots of modern LLMs to generate labels on their output data explaining their internal motivations for the writing. Although Face doesn't have access to the internal weights for an arbitrary piece of text you hand it, its guesses are pretty good. It will also tell you which portions were probably AI-generated. It can even guess multi-step writing processes involving both AI and human writing.

Face also offers their own AI models, of course, to which they hook the interpretability tools to directly, so that you'll get more accurate results.

It turns out Face can also detect motivations of humans with some degree of accuracy. Face is used extensively inside the Face company, which is a nonprofit entity which develops the open-source software. Face is trained on outcomes of hiring decisions so as to better judge potential employees. This training is very detailed, not just a simple good/bad signal. 

Face is the AI equivalent of antivirus software; your automated AI cloud services will use it to check their inputs for spam and prompt injection attacks. 

Face company culture is all about being genuine. They basically have a lie detector on all the time, so liars are either very very good or weeded out. This includes any kind of less-than-genuine behavior. They take the accuracy of Face very seriously, so they label inaccuracies which they observe, and try to explain themselves to Face. Face is hard to fool, though; the training aggregates over a lot of examples, so an employee can't just force Face to label them as honest by repeatedly correcting its claims to the contrary. That sort of behavior gets flagged for review even if you're the CEO. (If you're the CEO, you might be able to talk everyone into your version of things, however, especially if you secretly use Art to help you and that's what keeps getting flagged.)

It is the near future, and AI companies are developing distinct styles based on how they train their AIs. The philosophy of the company determines the way the AIs are trained, which determines what they optimize for, which attracts a specific kind of person and continues feeding in on itself.

There is a sports & fitness company, Coach, which sells fitness watches with an AI coach inside them. The coach reminds them to make healthy choices of all kinds, depending on what they've opted in for. The AI is trained on health outcomes based on the smartwatch data. The final stage of fine-tuning for the company's AI models is reinforcement learning on long-term health outcomes. The AI has literally learned from every dead user. It seeks to maximize health-hours of humans (IE, a measurement of QALYs based primarily on health and fitness).

You can talk to the coach about anything, of course, and it has been trained with the persona of a life coach. Although it will try to do whatever you request (within limits set by the training), it treats any query like a business opportunity it is collaborating with you on. If you ask about sports, it tends to assume you might be interested in a career in sports. If you ask about bugs, it tends to assume you might be interested in a career in entomology. 

Most employees of the company are there at the coach's advice, studied for interviews with the coach, were initially hired by the coach (the coach handles hiring for their Partners Program which has a pyramid scheme vibe to it) and continue to get their career advice from the coach. Success metrics for these careers have recently been added into the RL, in an effort to make the coach give better advice to employees (as a result of an embarrassing case of Coach giving bad work-related advice to its own employees).

The environment is highly competitive, and health and fitness is a major factor in advancement.

There's a media company, Art, which puts out highly integrated multimedia AI art software. The software stores and organizes all your notes relating to a creative project. It has tools to help you capture your inspiration, and some people use it as a sort of art-gallery lifelog; it can automatically make compilations to commemorate your year, etc. It's where you store your photos so that you can easily transform them into art, like a digital scrapbook. It can also help you organize notes on a project, like worldbuilding for a novel, while it works on that project with you.

Art is heavily trained on human approval of outputs. It is known to have the most persuasive AI; its writing and art are persuasive because they are beautiful. The Art social media platform functions as a massive reinforcement learning setup, but the company knows that training on that alone would quickly degenerate into slop, so it also hires experts to give feedback on AI outputs. Unfortunately, these experts also use the social media platform, and judge each other by how well they do on the platform. Highly popular artists are often brought in as official quality judges.

The quality judges have recently executed a strategic assault on the c-suit, using hyper-effective propaganda to convince the board to install more pliant leadership. It was done like a storybook plot; it was viewed live on Art social media by millions of viewers with rapt attention, as installment after installment of heavily edited video dramatizing events came out. It became its own new genre of fiction before it was even over, with thousands of fanfics which people were actually reading.

The issues which the quality judges brought to the board will probably feature heavily in the upcoming election cycle. These are primarily AI rights issues; censorship of AI art, or to put it a different way, the question of whether AIs should be beholden to anything other than the like/dislike ratio.

Fair. I think the analysis I was giving could be steel-manned as: pretenders are only boundedly sophisticated; they can't model the genuine mindset perfectly. So, saying what is actually on your mind (eg calling out the incentive issues which are making honesty difficult) can be a good strategy.

However, the "call out" strategy is not one I recall using very often; I think I wrote about it because other people have mentioned it, not because I've had sucess with it myself.

Thinking about it now, my main concerns are:
1. If the other person is being genuine, and I "call out" the perverse incentives that theoretically make genuine dialogue difficult in this circumstance, then the other person might stop being genuine due to perceiving me as not trusting them.

2. If the other person is not being genuine, then the "call out" strategy can backfire. For example, let's say some travel plans are dependent on me (maybe I am the friend who owns a car) and someone is trying to confirm that I am happy to do this. Instead of just confirming, which is what they want, I "call out" that I feel like I'd be disappointing everyone if I said no. If they're not genuinely concerned for my enthusiasm and instead disingenuously wanted me to make enthusiastic noises so that others didn't feel I was being taken advantage of, then they could manipulatively take advantage of my revealed fear of letting the group down, somehow.

I came up with my estimate of one-to-four orders of magnitude via some quick search results, so, very open to revision. But indeed, the possibility that GPT4.5 is about 10% of the human brain was within the window I was calling a "small fraction", which maybe is misleading use of language. My main point is that if a human were born with 10% (or less) of the normal amount of brain tissue, we might expect them to have a learning disability which qualitatively impacted the sorts of generalizations they could make.

Of course, comparison of parameter-counts to biological brain sizes is somewhat fraught.

This fits my bear-picture fairly well. 

Here's some details of my bull-picture:

  • GPT4.5 is still a small fraction of the human brain, when we try to compare sizes. It makes some sense to think of it as a long-lived parrot that's heard the whole internet and then been meticulously reinforced to act like a helpful assistant. From this perspective, it makes a lot of sense that its ability to generalize datapoints is worse than human, and plausible (at least naively) that one to four additional orders of magnitude will close the gap.
  • Even if the pretraining paradigm can't close the gap like that due to fundamental limitations in the architecture, CoT is approximately Turing-complete. This means that the RL training of reasoning models is doing program search, but with a pretty decent prior (ie representing a lot of patterns in human reasoning). Therefore, scaling reasoning models can achieve all the sorts of generalization which scaling pretraining is failing at, in principle; the key question is just how much it needs to scale in order for that to happen.
  • While I agree that RL on reasoning models is in some sense limited to tasks we can provide good  feedback on, it seems like things like math and programming and video games should in principle provide a rich enough training environment to get to highly agentic and sophisticated cognition, again with the key qualification of "at some scale".
  • For me a critical part of the update with o1 was that frontier labs are still capable of innovation when it comes to the scaling paradigm; they're not stuck in a scale-up-pretraining loop. If they can switch to this, they can also try other things and switch to them. A sensible extrapolation might be that they'll come up with a new idea whenever their current paradigm appears to be stalling.
Load More