Raised: $267,499Goal 1 of 3: $1,000,000

If you know nothing about a thing, the first example or sample gives you a disproportionate amount of information, often more than any subsequent sample. It lets you locate the idea in conceptspace, get a sense of what domain/scale/magnitude you're dealing with, and provides an anchor for further thinking.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Has some government or random billionaire sought out Petrov's heirs and made sure none of them have to work again if they don't want to? It seems like an obviously sensible thing to do from a game-theoretic point of view.
gwernΩ278328
7
LLM support for writing LessWrong posts: virtual comments. Back in August I discussed with Rafe & Oliver a bit about how to integrate LLMs into LW2 in ways which aren't awful and which encourage improvement---particularly using the new 'prompt caching' feature. To summarize one idea: we can use long-context LLMs with prompt caching to try to simulate various LW users of diverse perspectives to write useful feedback on drafts for authors. (Prompt caching (eg) is the Transformer version of the old RNN hidden-state caching trick, where you run an input through the (deterministic) NN, and then save the intermediate version, and apply that to arbitrarily many future inputs, to avoid recomputing the first input each time, which is the naive way to do it. You can think of it as a lightweight finetuning. This is particularly useful if you are thinking about having large generic prompts---such as an entire corpus. A context window of millions of tokens might take up to a minute & $1 to compute currently, so you definitely need to be careful and don't want to compute more than once.) One idea would be to try to use LLMs to offer feedback on drafts or articles. Given that tuned LLM feedback from Claude or ChatGPT is still not that great, tending towards sycophancy or obviousness or ChatGPTese, it is hardly worthwhile running a post through a generic "criticize this essay" prompt. (If anyone on LW2 wanted to do such a thing, they are surely capable of doing it themselves, and integrating it into LW2 isn't that useful. Removing the friction might be helpful, but it doesn't seem like it would move any needles.) So, one way to force out more interesting feedback would be to try to force LLMs out of the chatbot assistant mode-collapse, and into more interesting simulations for feedback. There has been some success with just suggestively-named personas or characters in dialogues (you could imagine here we'd have "Skeptic" or "Optimist" characters), but we can do better. Since th
As of right now, I expect we have at least a decade, perhaps two, until we get a human intelligence level generalizing AI (which is what I consider AGI). This is a controversial statement in these social circles, and I don't have the bandwidth or resources to write a concrete and detailed argument, so I'll simply state an overview here. * Scale is the key variable driving progress to AGI. Human ingenuity is irrelevant. Lots of people believe they know the one last piece of the puzzle to get AGI, but I increasingly expect the missing pieces to be too alien for most researchers to stumble upon just by thinking about things without doing compute-intensive experiments. * Scale shall increasingly require more and larger datacenters and a lot of power. Humanity's track record at accomplishing megaprojects is abyssmal. If we find ourselves needing to build city-sized datacenters (with all the required infrastructure to maintain and supply it), I expect that humanity will take twice the initially estimated time and resources to build something with 80% of the planned capacity. So the main questions for me, given my current model, are these: * How many OOMs of optimization power would you need for your search process, to stumble upon a neural network model (or more accurately, an algorithm), that is just general enough that it can start improving itself? (To be clear, I expect this level of generalization to be achieved when we create AI systems that can do ML experiments autonomously.) * How much more difficult will each OOM increase be? (For example, if we see an exponential increase in resources and time to build the infrastructure, that seems to compensate for the exponential increase in the optimization power provided.) Both questions are very hard to answer with rigor I'd consider adequate given their importance. If you did press me to answer, however: my intuition is that we'd need at least three OOMs and that the OOM-increase difficulty would be exponential,
Morphism260
1
People often say things like "do x. Your future self will thank you." But I've found that I very rarely actually thank my past self, after x has been done, and I've reaped the benefits of x. This quick take is a preregistration: For the next month I will thank my past self more, when I reap the benefits of a sacrifice of their immediate utility. e.g. When I'm stuck in bed because the activation energy to leave is too high, and then I overcome that and go for a run and then feel a lot more energized, I'll look back and say "Thanks 7 am Morphism!" (I already do this sometimes, but I will now make a TAP out of it, which will probably cause me to do it more often.) Then I will make a full post describing in detail what I did and what (if anything) changed about my ability to sacrifice short-term gains for greater long-term gains, along with plausible theories w/ probabilities on the causal connection (or lack thereof), as well as a list of potential confounders. Of course, it is possible that I completely fail to even install the TAP. I don't think that's very likely, because I'm #1-prioritizing my own emotional well-being right now (I'll shift focus back onto my world-saving pursuits once I'm more stablely not depressed). In that case I will not write a full post because the experiment would have not even been done. I will instead just make a comment on this shortform to that effect.
Two shovel-ready theory projects in interpretability. Most scientific work, especially theoretical research, isn't "shovel-ready." It's difficult to generate well-defined, self-contained projects where the path forward is clear without extensive background context. In my experience, this is extra true of theory work, where most of the labour if often about figuring out what the project should actually be, because the requirements are unclear or confused. Nevertheless, I currently have two theory projects related to computation in superposition in my backlog that I think are valuable and that maybe have reasonably clear execution paths. Someone just needs to crunch a bunch of math and write up the results.  Impact story sketch: We now have some very basic theory for how computation in superposition could work[1]. But I think there’s more to do there that could help our understanding. If superposition happens in real models, better theoretical grounding could help us understand what we’re seeing in these models, and how to un-superpose them back into sensible individual circuits and mechanisms we can analyse one at a time. With sufficient understanding, we might even gain some insight into how circuits develop during training. This post has a framework for compressing lots of small residual MLPs into one big residual MLP. Both projects are about improving this framework. 1) I think the framework can probably be pretty straightforwardly extended to transformers. This would help make the theory more directly applicable to language models. The key thing to show there is how to do superposition in attention. I suspect you can more or less use the same construction the post uses, with individual attention heads now playing the role of neurons. I put maybe two work days into trying this before giving it up in favour of other projects. I didn’t run into any notable barriers, the calculations just proved to be more extensive than I’d hoped they’d be.  2) Improve error t

Popular Comments

Recent Discussion

TLDR: LessWrong + Lighthaven need about $3M for the next 12 months. Donate here, or send me an email, DM signal message (+1 510 944 3235), or leave a comment, if you want to support what we do. We are a registered 501(c)3, have big plans for the next year, and due to a shifting funding landscape need support from a broader community more than in any previous year. [1]

I've been running LessWrong/Lightcone Infrastructure for the last 7 years. During that time we have grown into the primary infrastructure provider for the rationality and AI safety communities. "Infrastructure" is a big fuzzy word, but in our case, it concretely means: 

  • We build and run LessWrong.com and the AI Alignment Forum.[2]
  • We built and run Lighthaven (lighthaven.space), a ~30,000 sq. ft.
...
1Mateusz Bagiński
I'm considering donating. Any chance of setting up some tax deduction for Euros?

I am working on it! What country would you want it for? Not all countries have charity tax-deductability, IIRC.

3Mckiev
Regarding donor recognition, I think online recognition makes a lot of sense, e.g. colored nickname on the forum, or a badge of some sorts (like the one GWWC encourages). Thank you for clearly laying out the arguments for donating to the Lightcone. I will! 
6kave
The EIN is 92-0861538

YouTube link

The ‘model organisms of misalignment’ line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he’s worked on at Anthropic under this agenda: “Sleeper Agents” and “Sycophancy to Subterfuge”.

Topics we discuss:

...
2gwern
I don't see why there would necessarily be machine-translated inner-monologues, though. If they are doing the synthesis or stitching-together of various inner-monologues with the pivot phrases like "wait" or "no, that's wrong", they could simply do that with the 'native' Chinese versions of every problem rather than go through a risky lossy translation pass. Chinese-language-centric LLMs are not so bad at this point that you need to juice them with translations of English corpuses - are they? Or if they are using precanned English inner-monologues from somewhere, why do they need to translate at all? You would think that it would be easy for multi-lingual models like current LLMs, which so easily switch back and forth between languages, to train on o1-style English inner-monologues and then be able to 'zero-shot' generate o1-style Chinese inner-monologues on demand. Maybe the weirdness is due to that, instead: it's being very conservative and is imitating the o1-style English in Chinese as literally as possible.

I'm at ~50-50 for large amounts of machine-translated being present in the dataset. 

Having worked in Chinese academia myself, "use Google Translate on the dataset" just seems like something we're extremely likely to do. It's a hard-to-explain gut feeling. I'll try poking around in the tokenizer to see if "uncommon Chinese phrases that would only appear in machine-translated COT" are present as tokens. (I think this is unlikely to be true even if they did do it, however)

I've done a cursory internet search, and it seems that there aren't many native Chi... (read more)

TLDR: In this post, I argue that if you are a junior AI safety researcher, you should consider applying to PhD programs in ML soon, especially if you have recently participated in an AI safety upskilling or research program like MATS or ARENA and might be interested in working on AI safety long term, but don't have immediate career plans. It is relatively cheap to apply and provides good future option value. I don’t argue that you should necessarily do a PhD, but some other posts do. I also point out that starting a PhD does not lock you in if better opportunities arise. PhD application deadlines for Fall 2025 start are coming up soon; many application deadlines are December 15th, though some are as early as next week. For the uninitiated, I provide a...

joec10

I'll have to push back on this. I think if there's one specific program that you'd like to go to, especially if there's an advisor you have in mind, it's good to tailor your application to that program. However, this might not apply to the typical reader of this post.

I followed a k strategy with my PhD statements of purpose (and recommendations) rather than an r strategy. I tailored my applications to the specific schools, and it seemed to work pretty decently well. I know of more qualified people who were rejected from a much higher proportion of schools who spent much less time on each application. 

(Disclaimer: this is all anecdotal. Also, I was applying for chemistry programs, not AI)

1Joshua Tindall
I agree that it's not impossible, but it's definitely very late in the cycle to start thinking about PhD applications, and the claim that it would be more helpful to make the case for a PhD to people earlier in the cycle seems totally reasonable to me
2nc
It is worth noting that UKRI is in the process of changing their language to Doctoral Landscape Awards (replacing DTP) and Doctoral Focal Awards (CDT). The announcements for BBSRC and NERC have already been done, but I can't find what EPSRC is doing.
3Joseph Miller
I started working on PhD applications about 12 days ago. I expect to have fairly polished applications for the first deadline on December 1, despite not working on this full time. So I think it's quite possible to do applications for the December 15 deadlines. You would need to contact your referees (and potential supervisors for UK universities) in the next couple of days.

This is the second in a sequence of posts scrutinizing computational functionalism (CF). In my last post, I defined a concrete claim that computational functionalists tend to make:

Practical CF: A simulation of a human brain on a classical computer, capturing the dynamics of the brain on some coarse-grained level of abstraction, that can run on a computer small and light enough to fit on the surface of Earth, with the simulation running at the same speed as base reality[1], would cause the same conscious experience as that brain.

I contrasted this with “theoretical CF”, the claim that an arbitrarily high-fidelity simulation of a brain would be conscious. In this post, I’ll scrutinize the practical CF claim.

To evaluate practical CF, I’m going to meet functionalists where they (usually) stand...

2Steven Byrnes
From reading this comment, I understand that you mean the following: * Practical CF, more explicitly: A simulation of a human brain on a classical computer, capturing the dynamics of the brain on some coarse-grained level of abstraction, that can run on a computer small and light enough to fit on the surface of Earth, with the simulation running at the same speed as base reality, would cause the same conscious experience as that brain, in the specific sense of thinking literally the exact same sequence of thoughts in the exact same order, in perpetuity. I agree that “practical CF” as thus defined is false—indeed I think it’s so obviously false that this post is massive overkill in justifying it. But I also think that “practical CF” as thus defined is not in fact a claim that computational functionalists tend to make. Let’s put aside simulation and talk about an everyday situation. Suppose you’re the building manager of my apartment, and I’m in my apartment doing work. Unbeknownst to me, you flip a coin. If it’s heads, then you set the basement thermostat to 20°C. If it’s tails, then you set the basement thermostat to 20.1°C. As a result, the temperature in my room is slightly different in the two scenarios, and thus the temperature in my brain is slightly different, and this causes some tiny number of synaptic vesicles to release differently under heads versus tails, which gradually butterfly-effect into totally different trains of thought in the two scenarios, perhaps leading me to make a different decision on some question where I was really ambivalent and going back and forth, or maybe having some good idea in one scenario but not the other. But in both scenarios, it’s still “me”, and it’s still “my mind” and “my consciousness”. Do you see what I mean? So anyway, when you wrote “A simulation of a human brain on a classical computer…would cause the same conscious experience as that brain”, I initially interpreted that sentence as meaning something more li

Practical CF, more explicitly: A simulation of a human brain on a classical computer, capturing the dynamics of the brain on some coarse-grained level of abstraction, that can run on a computer small and light enough to fit on the surface of Earth, with the simulation running at the same speed as base reality, would cause the same conscious experience as that brain, in the specific sense of thinking literally the exact same sequence of thoughts in the exact same order, in perpetuity.

I... don't think this is necessarily what @EuanMcLean meant? At the risk o... (read more)

5gwern
I think one could argue with a lot of your description of how Charles Darwin developed his theory of evolution after the H.M.S. Beagle expedition and decades of compiling examples and gradually elaborating a theory before he finally finished Origin of Species.
1Rafael Harth
Fair enough; the more accurate response would have been that evolution might be an example, depending on how the theory was derived (which I don't know). Maybe it's not actually an example. The crux would be when exactly he got the idea; if the idea came first and the examples later, then it's still largely analogous (imo); if the examples were causally upstream of the core idea, then not so much.

Suppose you’ve trained a really clever AI model, and you’re planning to deploy it in an agent scaffold that allows it to run code or take other actions. You’re worried that this model is scheming, and you’re worried that it might only need to take a small number of actions to get to a dangerous and hard-to-reverse situation like exfiltrating its own weights. Problems like these that the AI can cause with small numbers of actions are particularly hard to address with standard safety techniques, because by default you are unlikely to notice the problem until the hard-to-reverse outcome has already occurred.

One of the main classes of techniques that you could apply in the hope of preventing these catastrophic actions is what we call untrusted monitoring (introduced...

One aspect of untrusted monitoring which this post neglects, but which seems very important to me, is that we'll be uncertain about which model should be untrusted and the extent to which these models are untrusted.

This suggests that ensembling over untrusted monitors could be very promising (and much of the benefits of ensembling can be obtained without preventing collusion). Further, I expect that various types of ensembling make collusion harder even if all relevant models in the ensemble are scheming.

Ensembling untrusted monitors increases variance som... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Summary: Comfort with really big and really small quantities is very useful for understanding the world and is perfectly doable with practice. Therefore, you should (1) try to familiarize yourself with the sizes of various things using e.g. spaced repetition and (2) comment some of your favorite numerical facts about the world that give a sense of scale so others can learn from them. I have a list of links at the bottom that I'd love for you to add to, in a similar spirit to The Best Textbooks on Every Subject.

 

One of the greatest lies that science fiction and popular science told me was that there were features of the universe that are incomprehensibly small or big. Sure, femtoseconds, angstroms, parsecs and electron-volts are foreign to...

I've been wanting to write a nice post for a few months, but should probably just write one sooner instead. This is a top-level post not because it's a long post, but because it's an important one.

Anyways. Cryonics is pretty much money-free now (ie. subsidized technically)—one of the most affordable ways to dispose of your body post-mortem.

Don't die because you didn't double-check whether the argument you came up against cryonics in 5 seconds checks out.

In the west coast in the USA, from Oregon Brain Preservation, as of around May 2024 I think:

Our research program is open to individuals in Washington, Oregon, and Northern California. This is the same brain preservation procedure, with the goal of future revival if this ever becomes feasible and humane. The difference is

...

Oh wow! Damn ☹️ Well, I'm super grateful for the time it was active. If you don't mind sending me a copy of your exchanges, I'd be interested.

Has some government or random billionaire sought out Petrov's heirs and made sure none of them have to work again if they don't want to? It seems like an obviously sensible thing to do from a game-theoretic point of view.