All of kave's Comments + Replies

kave20

I'm inclined to agree, but at least this is an improvement over it only living in Habryka's head. It may be that this + moderation is basically sufficient, as people seem to have mostly caught on to the intended patterns.

kave*160

I spent some time Thursday morning arguing with Habryka about the intended use of react downvotes. I think I now have a fairly compact summary of his position.

PSA: When to upvote and downvote a react

Upvote a react when you think it's helpful to the conversation (or at least, not antihelpful) and you agree with it. Imagine a react were a comment. If you would agree-upvote it and not karma-downvote it, you can upvote the react.

Downvote a react when you think it's unhelpful for the conversation. This might be because you think the react isn't being used for i... (read more)

It's not really feasible for the feature to rely on people reading this PSA to work well. The correct usage needs to be obvious.

follow up: if you would disagree-vote with a react but not karma downvote, you can use the opposite react. 

kave*Ω462

You claim (and I agree) that option control will probably not be viable at extreme intelligence levels. But I also notice that when you list ways that AI systems help with alignment, all but one (maybe two), as I count it, are option control interventions.

evaluating AI outputs during training, labeling neurons in the context of mechanistic interpretability, monitoring AI chains of thought for reward-hacking behaviors, identifying which transcripts in an experiment contain alignment-faking behaviors, classifying problematic inputs and outputs for the purpos

... (read more)
kave*Ω340

I do not think your post is arguing for creating warning shots. I understand it to be advocating for not averting warning shots.

To extend your analogy, there are several houses that are built close to a river, and you think that a flood is coming that will destroy them. You are worried that if you build a dam that would protect the houses currently there, then more people will build by the river and their houses will be flooded by even bigger floods in the future. Because you are worried people will behave in this bad-for-them way, you choose not to help t... (read more)

kaveΩ464

I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.

It seems naïvely evil to knowingly let the world walk into a medium-sized catastrophe. To be clear, I think that sometimes it is probably evil to stop the world from walking into a catastrophe, if you think that increases the risk of bad things like extinctions. But I think the prior of not diagonalising against others (and of not giving yourself rope with which to trick yourself) is strong.

3Jan_Kulveit
The quote is somewhat out of context. Imagine a river with some distribution of flood sizes. Imagine this proposed improvement: a dam which is able to contain 1-year, 5-year and 10-year floods. It is too small for 50-year floods or larger, and may even burst and make the flood worse. I think such device is not an improvement, and may make things much worse - because of the perceived safety, people may build houses close to the river, and when the large flood hits, the damages could be larger.   I have hard time parsing what do you want to say relative to my post. I'm not advocating for people to deliberately create warning shots. 
kave94

there's evidence about bacteria manipulating weather for this purpose

Sorry, what?

gwern160

Ice-nucleating bacteria: https://www.nature.com/articles/ismej2017124 https://www.sciencefocus.com/planet-earth/bacteria-controls-the-weather

If you can secrete the right things, you can potentially cause rain/snow inside clouds. You can see why that might be useful to bacteria swept up into the air: the air may be a fine place to go temporarily, and to go somewhere, but like a balloon or airplane, you do want to come down safely at some point, usually somewhere else, and preferably before the passengers have begun to resort to cannibalism. So given that ev... (read more)

kaveΩ342

I think you train Claude 3.7 to imitate the paraphrased scratchpad, but I'm a little unsure because you say "distill". Just checking that Claude 3.7 still produces CoT (in the style of the paraphrase) after training, rather than being trained to perform the paraphrased-CoT reasoning in one step?

4Fabien Roger
By distillation, I mean training to imitate. So in the distill-from-paraphrased setting, the only model involved at evaluation time is the base model fine-tuned on paraphrased scratchpads, and it generates an answer from beginning to end.
kave*93

It's been a long time since I looked at virtual comments, as we never actually merged them in. IIRC, none were great, but sometimes they were interesting (in a kind of "bring your own thinking" kind of way).

They were implemented as a Turing test, where mods would have to guess which was the real comment from a high karma user. If they'd been merged in, it would have been interesting to see the stats on guessability.

kave3-1

Could exciting biotech progress lessen the societal pressure to make AGI?

Suppose we reach a temporary AI development pause. We don't know how long the pause will last; we don't have a certain end date nor is it guaranteed to continue. Is it politically easier for that pause to continue if other domains are having transformative impacts?

I've mostly thought this is wishful thinking. Most people don't care about transformative tech; the absence of an alternative path to a good singularity isn't the main driver of societal AI progress.

But I've updated some her... (read more)

kave71

I think your comment is supposed to be an outside view argument that tempers the gears-level argument in the post. Maybe we could think of it as providing a base-rate prior for the gears-level argument in the post. Is that roughly right? I'm not sure how much I buy into this kind of argument, but I also have some complaints by the outside views lights.

First, let me quickly recap your argument as I understand it.

R&D increases welfare by allowing an increase in consumption. We'll assume that our growth in consumption is driven, in some fraction, by R&

... (read more)
1Vasco Grilo
Thanks for the great summary, Kave! Nitpick. SWP received 1.82 M 2023-$ (= 1.47*10^6*1.24) during the year ended on 31 March 2024, which is 1.72*10^-8 (= 1.82*10^6/(106*10^12)) of the gross world product (GWP) in 2023, and OP estimated R&D has a benefit-to-cost ratio of 45. So I estimate SWP can only be up to 1.29 M (= 1/(1.72*10^-8)/45) times as cost-effective as R&D due to this increasing SWP’s funding. Fair points, although I do not see how they would be sufficiently strong to overcome the large baseline difference between SWP and general R&D. I do not think reducing the nearterm risk of human extinction is astronomically cost-effective, and I am sceptical of longterm effects.
kave20

From population mean or from parent mean?

4GeneSmith
Population mean
kave70

Curated. Genetically enhanced humans are my best guess for how we achieve existential safety. (Depending on timelines, they may require a coordinated slowdown to work). This post is a pretty readable introduction to a bunch of the why and how and what still needs to be down.

I think this post is maybe slightly too focused on "how to genetically edit for superbabies" to fully deserve its title. I hope we get a treatment of more selection-based methods sometime soon.

GeneSmith mentioned the high-quality discussion as a reason to post here, and I'm glad we're a... (read more)

5GeneSmith
Yes, the two other approaches not really talked about in this thread that could also lead to superbabies are iterated meiotic selection and genome synthesis. Both have advantages over editing (you don't need to have such precise knowledge of causal alleles with iterated meiotic selection or with genome synthesis), but my impression is they're both further off than an editing approach. I'd like to write more about both in the future.
kave*30

My understanding when I last looked into it was that the efficient updating of the NNUE basically doesn't matter, and what really matters for its performance and CPU-runnability is its small size.

kave22

I'm not aware of a currently published protocol; sorry for confusing phrasing!

kave70

There are various technologies that might let you make many more egg cells than are possible to retrieve from an IVF cycle. For example, you might be able to mature oocytes from an ovarian biopsy, or you might be able to turn skin cells into eggs.

2JenniferRM
Wait, what? I know Aldous Huxley is famous for writing a scifi novel in 1931 titled "Don't Build A Method For Simulating Ovary Tissue Outside The Body To Harvest Eggs And Grow Clone Workers On Demand In Jars" but I thought that his warning had been taken very very seriously. Are you telling me that science has stopped refusing to do this, and there is now a protocol published somewhere outlining "A Method For Simulating Ovary Tissue Outside The Body To Harvest Eggs"???
kave203

Copying over Eliezer's top 3 most important projects from a tweet:

1.  Avert all creation of superintelligence in the near and medium term.

2.  Augment adult human intelligence.

3.  Build superbabies.

kave40

Looks like the base url is supposed to be niplav.site. I'll change that now (FYI @niplav)

kaveΩ682

I think TLW's criticism is important, and I don't think your responses are sufficient. I also think the original example is confusing; I've met several people who, after reading OP, seemed to me confused about how engineers could use the concept of mutual information.

Here is my attempt to expand your argument.

We're trying to design some secure electronic equipment. We want the internal state and some of the outputs to be secret. Maybe we want all of the outputs to be secret, but we've given up on that (for example, radio shielding might not be practical or... (read more)

7johnswentworth
I think that's basically right, and good job explaining it clearly and compactly. I would also highlight that it's not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.
kave20

With LLMs, we might be able to aggregate more qualitative anonymous feedback.

kave20

The general rule is roughly "if you write a frontpage post which has an announcement at the end, that can be frontpaged". So for example, if you wrote a post about the vision for Online Learning, that included as a relatively small part the course announcement, that would probably work.

By the way, posts are all personal until mods process them, usually around twice a day. So that's another reason you might sometimes see posts landing on personal for awhile.

2Alex Flint
Got it! Good to know.
kave20

Mod note: this post is personal rather than frontpage because event/course/workshop/org... announcements are generally personal, even if the content of the course, say, is pretty clearly relevant to the frontpage (as in this case)

2Alex Flint
Thanks! We were wondering about that. Is there any way we could be changed to the frontpage category?
kave40

I believe it includes some older donations:

  1. Our Manifund application's donations, including donations going back to mid-May, totalling about $50k
  2. A couple of older individual donations, in October/early Nov, totalling almost 200k
kave*40

Mod note: I've put this on Personal rather than Frontpage. I imagine the content of these talks will be frontpage content, but event announcements in general are not.

kave70

neural networks routinely generalize to goals that are totally different from what the trainers wanted

I think this is slightly a non sequitor. I take Tom to be saying "AIs will care about stuff that is natural to express in human concept-language" and your evidence to be primarily about "AIs will care about what we tell it to", though I could imagine there being some overflow evidence into Tom's proposition.

I do think the limited success of interpretability is an example of evidence against Tom's proposition. For example, I think there's lots of work where... (read more)

8eggsyntax
Thanks, that's a totally reasonable critique. I kind of shifted from one to the other over the course of that paragraph.  Something I believe, but failed to say, is that we should not expect those misgeneralized goals to be particularly human-legible. In the simple environments given in the goal misgeneralization spreadsheet, researchers can usually figure out eventually what the internalized goal was and express it in human terms (eg 'identify rulers' rather than 'identify tumors'), but I would expect that to be less and less true as systems get more complex. That said, I'm not aware of any strong evidence for that claim, it's just my intuition. I'll edit slightly to try to make that point more clear.
kave551

I dug up my old notes on this book review. Here they are:

So, I've just spent some time going through the World Bank documents on its interventions in Lesotho. The Anti-Politics Machine is not doing great on epistemic checking

  • There is no recorded Thaba-Tseka Development Project, despite the period in which it should have taken place being covered
  • There is a Thaba-Bosiu development project (parts 1 and 2) taking place at the correct time.
    • Thaba-Bosiu and Thaba-Tseka are both regions of Lesotho
    • The spec doc for Thaba-Bosiu Part 2 references the alleged problems
... (read more)
8Benquo
Wow, thanks for doing the legwork on this - seems like quite possibly I'm analyzing fiction? Annoying if true. Google's AI response to my search for the Thaba-Tseka Development Project says: There's a good chance this is an AI hallucination, though; a cursory search of the main documents didn't yield any references to a "Thaba-Tseka development project," or the wood or ponies. I'm not familiar with World Bank documentation, though, and likely the right followup would involve looking at exactly what's cited in the book. However, the other lead funder, the Canadian International Development Agency, does seem to have at least one publicly referenced document about a "Thaba-Tseka rural development program": Evaluation, the Kingdom of Lesotho rural development : evaluation design for phase 1, the Thaba Tseka project

I think 2023 was perhaps the peak for discussing the idea that neural networks have surprisingly simple representations of human concepts. This was the year of Steering GPT-2-XL by adding an activation vectorcheese vectors, the slightly weird lie detection paper and was just after Contrast-consistent search.

This is a pretty exciting idea, because if it’s easy to find human concepts we want (or don’t want) networks to possess, then we can maybe use that to increase the chance that systems that are honest, kind, loving (and can ask them... (read more)

kave72

I'm not sure I understand what you're driving at, but as far as I do, here's a response: I have lots of concepts and abstractions over the physical world (like chair). I don't have many concepts or abstractions over strings of language, apart from as factored through the physical world. (I have some, like register or language, but they don't actually feel that "final" as concepts).

As far as factoring my predictions of language through the physical world, a lot of the simplest and most robust concepts I have are just nouns, so they're already represented by tokenisation machinery, and I can't do interesting interp to pick them out.

kave*115

That sounds less messy than the path from 3D physical world to tokens (and less (edit: I meant more here!) messy than the path from human concepts to tokens)

5Neel Nanda
Sure, but I think that human cognition tends to operate at a level of abstract above the configuration of atoms in a 3D environment. Like "that is a chair" is a useful way to reason about an environment. Whilethat "that is a configuration of pixels that corresponds to a chair when projected at a certain angle in certain lighting conditions" must first be converted to "that is a chair" before anything useful can be done. Text just has a lot of useful preprocessing applied already and is far more compressed
kave40

quality of tasks completed

quantity?

kave72

Just a message to confirm: Zac's leg of the trade has been executed for $810. Thanks Lucie for those $810!

8Zac Hatfield-Dodds
And I've received an email from Mieux Donner confirming Lucie's leg has been executed for 1,000€. Thanks to everyone involved! If if anyone else is interested in a similar donation swap, from either side, I'd be excited to introduce people or maybe even do this trick again :D
kave2-1

This doesn't play very well with fractional kelly though

kave22

I do feel like it would be good to start with a more optimistic prior on new posts. Over the last year, the mean post karma was a little over 13, and the median was 5.

kave24

This seems unlikely to satisfy linearity, as A/B + C/D is not equal to (A+C)/(B+D)

kave60

I don't feel particularly uncertain. This EA Forum comment and its parents inform my view quite a bit.

kave20

Maybe sometimes a team will die in the dungeon?

kave20

<details>blah blah</details>

kave*30

So I did some super dumb modelling.

I was like: let's assume that there aren't interaction effects between the encounters either in the difficulty along a path or in the tendency to co-occur. And let's assume position doesn't matter. Let's also assume that the adventurers choose the minimally difficult path, only moving across room edges.

To estimate the value of an encounter, let's look at how the dungeons where it occurs in one of the two unavoidable locations (1 and 9) differ on average from the overall average.

Assuming ChatGPT did all the implementation

... (read more)
kave40

I'm guessing encounter 4 (rather than encounter 6) follows encounter 3?

4aphyer
The dungeon is laid out as depicted; Room 3 does not border Room 4, and does border Room 6.  You don't, however, know what exactly the adventurers are going to do in your dungeon, or which encounters they are going to do in which order.  Perhaps you could figure that out from the dataset. (I've edited the doc to make this clearer).
kave40

You can simulate a future by short-selling the underlying security and buying a bond with the revenue. You can simulate short-selling the same future by borrowing money (selling a bond) and using the money to buy the underlying security.

I think these are backwards. At the end of your simulated future, you end up with one less of the stock, but you have k extra cash. At the end of your simulated short sell, you end up with one extra of the stock and k less cash.

4lsusr
You're right. Thanks. Fixed.

A neat stylised fact, if it's true. It would be cool to see people checking it in more domains.

I appreciate that Ege included all of examples, theory, and predictions of the theory. I think there's lots of room for criticism of this model, which it would be cool to see tried. In particular, as far as I understand the formalism, it doesn't seem like it is obviously discussing the costs of the investments, as opposed to their returns.

But I still like this as a rule of thumb (open to revision).

I still think this post is cool. Ultimately, I don't think the evidence presented here bares that strongly on the underlying question: "can humans get AIs to do their alignment homework?". But I think it bares on it at all, and was conducted quickly and competently.

I would like to live in a world where lots of people gather lots of weak pieces of evidence on important questions.

1Zane
I still think it was an interesting concept, but I'm not sure how deserving of praise this is since I never actually got beyond organizing two games.
kave20

Yep, if the first vote takes the score to ≤ 0, then the post will be dropped off the latest list. This is somewhat ameliorated by:

(a) a fair number of people browsing https://lesswrong.com/allPosts

(b) https://greaterwrong.com having chronological sort by default

(c) posts appearing in recent discussion in order that they're posted (though I do wonder if we filter out negative karma posts from recent discussion)

I often play around with different karma / sorting mechanisms, and I do think it would be nice to have a more Bayesian approach that started with a s... (read more)

1Sherrinford
Maybe the numerator of the score should remain at the initial karma until at least 4 people have voted, for example.
kave20

I had a quick look in the database, and you do have some tag filters set, which could cause the behaviour you describe

1Sherrinford
Thanks. I dud not see any, but I will check again. Maybe I also accidentally set them when i tried to check whether I had set any...
kave33
  • Because it's a number and a vector, you're unlikely to see anyone (other than programmers) trying to use i as a variable.

I think it's quite common to use i as index variable (for example, in a sum)

(edit: whoops, I see several people have mentioned this) 

2abstractapplic
You're right. I'll delete that aside.
kave162

In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.

I would contribute to a bounty for y'all to do this. I would like to know whether the slow progress is prompting-induced or not.

kave20

Click on the gear icon next to the feed selector 

1Sherrinford
No, all tags are on default weight.
Load More