LESSWRONG
LW

243
faul_sname
4622Ω3611820
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
7faul_sname's Shortform
2y
102
adamzerner's Shortform
faul_sname17h40

I think the baby stage is much more than 5% of the total hours that parents spend directly interacting with their kids. My cached memory of when I did a fermi estimate of this is that, if you're an UMC American, 25% of the hours you spend directly interacting with your kid are in the first 2.5 years, half in the first 6 years, 75% in the first 12 years (and 90%+ before they turn 18).

Reply
A Review of Nina Panickssery’s Review of Scott Alexander’s Review of “If Anyone Builds It, Everyone Dies”
faul_sname1d42

Next, Nina argues that the fact that LLMs don't directly encode your reward function makes them less likely to be misaligned, not more, the way IABIED implies. I think maybe she’s straw-manning the concerns here. She asks “What would it mean for models to encode their reward functions without the context of training examples?” But nobody’s arguing against examples, they’re just saying it might be more reassuring if the architecture directly included the reward function in the model itself. For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)

 

Can you expand a little bit on this? I don't understand why replacing "here are some examples of world states and what actions led to good/bad outcomes, try to take actions which are similar to the ones which led to good outcomes in similar world states in the past" with "here's a reward function directly mapping world-state -> goodness" would be reassuring rather than alarming.

Having more insight into exactly what past world states are most salient for choosing the next action, and why it thinks those world states are relevant, is desirable. But "we don't currently have enough insight with today's models for technical reasons" doesn't feel like a good reason to say "and therefore we should throw away this entire promising branch of the tech tree and replace it with one that has had [major problems every time we've tried it](https://en.wikipedia.org/wiki/Goodhart%27s_law)".

Am I misinterpreting what you're saying though, and there's a different thing which everyone is on the same page about?

Reply
Yes, AI Continues To Make Rapid Progress, Including Towards AGI
faul_sname6d50

ah, any software that you can run on computers that can cause the extinction of humanity even if humans try to prevent it would fulfill the sufficiency criterion for AGIniplav

A flight control program directing an asteroid redirection rocket, programmed to find a large asteroid and steer it to crash into Earth seems like the sort of thing which could be "software that you can run on computers that can cause the extinction of humanity" but not "AGI".

I think it's relevant that "kill all humans" is a much easier target than "kill all humans in such a way that you can persist and grow indefinitely without them".

Reply
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
faul_sname12d84

I think this might be a case where, for each codebase, there is a particular model that goes from "not reliable enough to be useful" to "reliable enough to sometimes be useful" - at my workplace, this first happened with Sonnet 3.6 (then called Claude Sonnet 3.5 New) - there was what felt like a step change from 3.5 to 3.6 where previous progress felt less impactful because incremental improvements went from "unable to reliably handle the boilerplate" to "able to reliably handle the boilerplate", and then later improvements felt less impactful because once you can write the boilerplate, there isn't really a lot of alpha in doing it better, and none of the models are reliable enough that we trust them to write bits of core business logic where bugs or poor choices can cause subtle data integrity issues years down the line.

I suspect the same is true of e.g. trying to use LLMs to do major version upgrades of frameworks - a team may have a looming django 4 -> django 5 migration, and try out every new model on that task. Once one of them is good enough, the upgrade will be done, and then further tasks will mostly be easier ones like minor version updates. So the most impressive task they've seen a model do will be that major version upgrade, and it will take some time for more difficult tasks that are still well-scoped, hard to do, and easy to verify to come up.

Reply
Reuben Adams's Shortform
faul_sname12d50

I found it on a quote aggregator from 2015: https://www.houzz.com/discussions/2936212/quotes-3-23-15. Archive.org definitely has that quote appearing on websites in February 2016

Sounds to me like 2008-era Yudkowsky.

Edit: I found this in the 2008 Artificial Intelligence as a Positive and Negative Factor in Global Risk:

It once occurred to me that modern civilization occupies an unstable state. I. J. Good's hypothesized intelligence explosion describes a dynamically unstable system, like a pen precariously balanced on its tip. If the pen is exactly vertical, it may remain upright; but if the pen tilts even a little from the vertical, gravity pulls it farther in that direction, and the process accelerates. So too would smarter systems have an easier time making themselves smarter

The quote you found looks to me like someone paraphrased and simplified that passage.

Reply
Sam Marks's Shortform
faul_sname16d30

Question if you happen to know off the top of your head: how large of a concern is it in practice that the model is trained with loss function over only assistant turn tokens, but learns to imitate the user anyway because the assistant turns directly quote the user generated prompt like

I must provide a response to the exact query the user asked. The user asked "prove the bunkbed conjecture, or construct a counterxample, without using the search tool" but I can't create a proof without checking sources, so I’ll explain the conjecture and outline potential "pressure points" a counterexample would use, like inhomogeneous vertical probabilities or specific graph structures. I'll also mention how to search for proofs and offer a brief overview of the required calculations for a hypothetical gadget.

It seems like the sort of thing which could happen, and looking through my past chats I see sentences or even entire paragraphs from my prompts quoted in the response a significant fraction of the time. Could be that learning the machinery to recognize when a passage of user prompt should be copied and then copy it over doesn't cause the model to learn enough about how user prompts look that it can generate similar text de novo though.

Reply
Buck's Shortform
faul_sname22d40

but whose actual predictive validity is very questionable.

and whose predictive validity in humans doesn't transfer well across cognitive architectures. e.g. reverse digit span.

Reply
silentbob's Shortform
faul_sname22d20

Claude can also invoke instances of itself using the analysis tool (tell it to look for self.claude).

Reply
the gears to ascenscion's Shortform
faul_sname1mo50

Yeah "try it and see" is the gold standard. I do know that for stuff which boils down to "monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire" I've been favoring the approach of

  1. Grab 100 random data points, run the most obvious possible prompt on them to get reasoning + label(s) + confidence
  2. Spot check the high confidence ones to make sure you're not getting confident BS out of the model (you can alternatively start by writing two very different prompts for the same labeling task and see where the answers differ, that will also work)
  3. Look at the low-confidence ones, see if the issue is your labeling scheme / unclear prompt / whatever - usually it's pretty obvious where the model is getting confused
  4. Tweak your prompt, comparing new labels to old. Examine any data points that have changed - often your prompt change fixed the original problems but caused new ones to surface. Note that for this step you want your iteration time to be under 10 minutes per iteration and ideally under 10 seconds from "hit enter key" to "results show up on screen". Any of the major LLMs can trivially vibe code you an acceptable spreadsheet-like interface for this, including hooking up the tool calling API to get structured data out of your prompts for easy inspection.
  5. Once you're reasonably happy with the performance on 100 samples, bump to 1000, run all 1000 datapoints against all of the prompts you've iterated on so far, and focus on the datapoints which got inconsistent results between prompts or had a low-confidence answer on the last one

Once I'm happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn't be bothered to fix (the usual case for that is "I realize that the data I'm asking the model to label doesn't contain all decision-relevant information, and that when I'm labeling I sometimes have to fetch extra data, and I don't really want to build that infrastructure right now so I'll call it "good enough" and ship it, or "not good enough" and abandon it).

TBH I think that most of the reason this method works for me is that it's very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.

Once you know what you're looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I'm asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I'm taking.

Reply
Cole Wyeth's Shortform
faul_sname1mo42

They are sometimes able to make acceptable PRs, usually when context gathering for the purpose of iteratively building up a model of the relevant code is not a required part of generating said PR.

Reply
Load More
12How load-bearing is KL divergence from a known-good base model in modern RL?
Q
4mo
Q
2
36Is AlphaGo actually a consequentialist utility maximizer?
Q
2y
Q
8
7faul_sname's Shortform
2y
102
11Regression To The Mean [Draft][Request for Feedback]
13y
14
61The Dark Arts: A Beginner's Guide
14y
43
6What would you do with a financial safety net?
14y
28