Often you can compare your own Fermi estimates with those of other people, and that’s sort of cool, but what’s way more interesting is when they share what variables and models they used to get to the estimate. This lets you actually update your model in a deeper way.

11Jameson Quinn
This is the second time I've seen this. Now it seems obvious. I remember liking it the first time, but also remember it being obvious. That second part of the memory is probably false. I think it's likely that this explained the idea so well that I now think it's obvious. In other words: very well done.
Customize
Sam Marks*140
0
I've noticed some common misconceptions that people have about OpenAI's recent paper on reward hacking and chain-of-thought (CoT) monitoring. I'll try to correct some of these below. To be clear, this isn't meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I've noticed among readers. The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings: 1. Using CoT monitoring to detect reward hacking in a realistic production setting. 2. Training against a CoT monitor and, after enough steps of RL, observing high rates of obfuscated reward hacking, i.e. reward hacking that isn't disclosed in the model's CoT. Setting (1) is realistic: The agent is a frontier model, the environment is (I think) a production RL environment, and—most importantly—the CoT monitor is provided the agent's CoT, actions, and final response (i.e. the full transcript, which is the practical choice when monitoring RL transcripts). However, I think some readers don't realize that setting (2) is different in a number of ways: 1. The agent is not a frontier model. 2. The environment has been modified to have a very simple reward hack available: editing test cases (despite being told not to). 3. [Most important] The CoT monitor can only see the agent's CoT; it cannot see its actions or final output. Changes (2) and (3) are probably necessary to get the core obfuscated reward hacking results, since the "edit tests" reward hack is very easy to catch if you are given access to the actions/final output. (I.e. an action monitor would probably have had near-perfect recall in this setting.) Why does this matter? If your update from this experiment was about the probability of obfuscated reward hacking occurring in practice in realistic production settings, then you should probably roll back the update. Rather, I think this experiment should be viewed as a nice demo of t
Fabien RogerΩ2852-5
20
I ran quick experiments that make me think that it's somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won't learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years. The experiments I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find that the resulting models: * Have relatively high next-token-prediction losses * Can't speak well (e.g. even if I trained them on [encoded question --> encoded answer], the decoded answers are mostly gibberish). This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, ...), and this is less true if I add translation pairs to the training mix. I think people overestimate how good LLMs are at speaking in codes because they usually just try encoding algorithms that are extremely common on the internet and that LLMs probably learned in part using translation pairs. These experiments were done at a relatively small scale (80k sequences of ~50 tokens each, for 40 epochs, for a total of 160M tokens), and I could imagine things being different for large scale SFT or SFT with better base models. But given that RL doesn't teach models that many new bits (you can distill r1 scratchpads using only a few MB of transcripts), I think this is still informative for large-scale RL. Context I ended up giving up on the broader projects as part of which I ran these experiments, but I think they are decently informative about the plausibility of LLMs learning to speak new incomprehensible languages. I did not carefully run any of the experiments above, and I welcome replications. Related observations from the literature I think there is also a somewhat common miscon
Raemon*112
1
Metastrategy = Cultivating good "luck surface area"? Metastrategy: being good at looking at an arbitrary situation/problem, and figure out what your goals are, and what strategies/plans/tactics to employ in pursuit of those goals. Luck Surface area: exposing yourself to a lot of situations where you are more likely to get valuable things in a not-very-predictable way. Being "good at cultivating luck surface area" means going to events/talking-to-people/consuming information that are more likely to give you random opportunities / new ways of thinking / new partners. At one of my metastrategy workshops, while I talked with a participant about what actions had been most valuable the previous year, many of the things were like "we published a blogpost, or went to an event, and then kinda randomly found people who helped us a bunch, i.e. gave us money or we ended up hiring them." This led me to utter the sentence "yeah, okay I grudgingly admit that 'increasing your luck surface area' is more important than being good at 'metastrategy'", and I improvised a session on "where did a lot of your good luck come from this year, and how could you capitalize more on that?" But, thinking later, I think maybe actually "being good at metastrategy" and "being good at managing luck surface area" are maybe basically the same thing? That is: If already know how to handle a given situation, you're basically using "strategy", not "metastrategy." If you don't already know, what you wanna do is strategically direct your thoughts in novel directions (maybe by doing crazy brainstorming, maybe by doing structured "think about the problem in a bunch of different ways that seem likely to help", maybe by taking a shower and letting your mind wander, maybe by talking to people who will likely have good advice about your problem. This is basically "exposing luck surface area" for your cognition. Thinking about it more and chatting with a friend: Managing Luck Surface Area seems like a sub
In my post on value systematization I used utilitarianism as a central example of value systematization. Value systematization is important because it's a process by which a small number of goals end up shaping a huge amount of behavior. But there's another different way in which this happens: core emotional motivations formed during childhood (e.g. fear of death) often drive a huge amount of our behavior, in ways that are hard for us to notice. Fear of death and utilitarianism are very different. The former is very visceral and deep-rooted; it typically influences our behavior via subtle channels that we don't even consciously notice (because we suppress a lot of our fears). The latter is very abstract and cerebral, and it typically influences our behavior via allowing us to explicitly reason about which strategies to adopt. But fear of death does seem like a kind of value systematization. Before we have a concept of death we experience a bunch of stuff which is scary for reasons we don't understand. Then we learn about death, and then it seems like we systematize a lot of that scariness into "it's bad because you might die". But it seems like this is happening way less consciously than systematization to become a utilitarian. So maybe we need to think about systematization happening separately in system 1 and system 2? Or maybe we should think about it as systematization happening repeatedly in "layers" over time, where earlier layers persist but are harder to access later on. I feel pretty confused about this. But for now my mental model of the mind is as two (partially overlapping) inverted pyramids, one bottoming out in a handful of visceral motivations like "fear of death" and "avoid pain" and "find love", and the other bottoming out in a handful of philosophical motivations like "be a good Christian" or "save the planet" or "make America great again" or "maximize utility". The second (system 2) pyramid is trying to systematize the parts of system 1 that
leogao60
0
made an estimate of the distribution of prices of the SPX in one year by looking at SPX options prices, smoothing the implied volatilities and using Breeden-Litzenberger. (not financial advice etc, just a fun side project)

Popular Comments

Recent Discussion

When physicists were figuring out quantum mechanics, one of the major constraints was that it had to reproduce classical mechanics in all of the situations where we already knew that classical mechanics works well - i.e. most of the macroscopic world. Likewise for special and general relativity - they had to reproduce Galilean relativity and Newtonian gravity, respectively, in the parameter ranges where those were known to work. Statistical mechanics had to reproduce the fluid theory of heat; Maxwell's equations had to agree with more specific equations governing static electricity, currents, magnetic fields and light under various conditions.

Even if the entire universe undergoes some kind of phase change tomorrow and the macroscopic physical laws change entirely, it would still be true that the old laws did work...

There's a bit of disconnect from statement "it all adds up to normality" and how it actually could look like. Suppose someone discovered new physical law, if you say "fjrjfjjddjjsje" out loud you can fly, somehow. Now you are reproducing all the old model's confirmed behaviour while flying. Is this normal?

AI for Epistemics is about helping to leverage AI for better truthseeking mechanisms — at the level of individual users, the whole of society, or in transparent ways within the AI systems themselves. Manifund & Elicit recently hosted a hackathon to explore new projects in the space, with about 40 participants, 9 projects judged, and 3 winners splitting a $10k prize pool. Read on to see what we built!

 

Resources

  • See the project showcase: https://moxsf.com/ai4e-hacks
  • Watch the recordings: project demos, opening speeches
  • See the outline of project ideas: link
    • Thanks to Owen Cotton-Barratt, Raymond Douglas, and Ben Goldhaber for preparing this!
  • Lukas Finnveden on “What's important in ‘AI for epistemics’?”: link
  • Automation of Wisdom and Philosophy essay contest: link

Why this hackathon?

From the opening speeches; lightly edited.

Andreas Stuhlmüller: Why I'm excited about AI for Epistemics

In...

Just asked Claude for thoughts on some technical mech interp paper I'm writing. The difference with and without the non-sycophantic prompt is dramatic (even with extended thinking):
Me: What do you think of this bullet point structure for this section based on those results?
Claude (normal): I think your proposed structure makes a lot of sense based on the figures and table. Here's how I would approach reorganizing this section: {expand my bullet points but sticks to my structure}
Claude (non-sycophantic): I like your proposed structure, but I think it misses... (read more)

Prerequisites: Graceful Degradation. Summary of that: Some skills require the entire skill to be correctly used together, and do not degrade well. Other skills still work if you only remember pieces of it, and do degrade well. 

Summary of this: The property of graceful degradation is especially relevant for skills which allow groups of people to coordinate with each other. Some things only work if everyone does it, other things work as long as at least one person does it.

I.  

Examples: 

  • American Sign Language is pretty great. It lets me talk while underwater or in a loud concert where I need earplugs. If nobody else knows ASL though, it's pretty useless for communication.
  • Calling out a play in football is pretty great. It lets the team maneuver the ball to where it
...
R S10

I agree with this although it makes me think about company culture 

There is huge emergent value to some of the.. let's call them "softer" communication approaches 

It becomes possible to get out of random suboptimal Nash equilibriums almost immediately 

People can give more to each other, and better receive feedback 

But I think the only way to do this is by having the type of people who already think in those terms and prefer them 

There's not a lot you can do to enforce it 

But it's still a thing, and in my opinion it's still a t... (read more)

PDF version. berkeleygenomics.org. Twitter thread. (Bluesky copy.)

Summary

The world will soon use human germline genomic engineering technology. The benefits will be enormous: Our children will be long-lived, will have strong and diverse capacities, and will be halfway to the end of all illness.

To quickly bring about this world and make it a good one, it has to be a world that is beneficial, or at least acceptable, to a great majority of people. What laws would make this world beneficial to most, and acceptable to approximately all? We'll have to chew on this question ongoingly.

Genomic Liberty is a proposal for one overarching principle, among others, to guide public policy and legislation around germline engineering. It asserts:

Parents have the right to freely choose the genomes of their children.

If upheld,...

2bhauth
No, you can't have that from SNP editing. I suppose you're basing this off stuff like GeneSmith's post and didn't follow the arguments in the comments? Maybe I'll write a proper rebuttal at some point; I just haven't felt like bothering so far.
2TsviBT
I'm basing this off of selection, not editing. I haven't looked into the genetics stuff very much, because the bottleneck is biotech, not polygenic scores. Would look forward to your rebuttal! I just hope you'll respond to the strongest arguments, not the weakest. In particular, if you want to argue against the potential effectiveness of selection methods, I think you'd want to either argue that PGSes aren't picking up causal variants at all (I mean, that there's a large amount of correlation that isn't causation); or that the causality would top out / have strongly diminishing returns on the trait. Selection methods would capture approximately all of the causal stuff that the PGS is picking up, even if it's not even due to SNPs but rather rarer SNVs. (However, this would not apply to population stratification or something; then I'd think you'd want to argue that this is much / most of what PGSes are picking up, and there'd be already-made counterarguments to this that you should respond to in order to be convincing.)
2bhauth
Well, let's talk in terms of your intuitions for a bit. Why do you expect that only a very small fraction of natural SNP differences are needed to get the extremes of natural results or beyond those? Why do you expect that effects would be linear?
TsviBT20

I'll repeat that I'm not very learned about genetics, so if you want to convince even me in particular, the best way is to respond to the strongest case, which I can't present. But ok:

First I'll say that an empirical set of facts I'd quite like to have for many traits (disease, mental disease, IQ, personality) would be validation of tails. E.g. if you look at the bottom 1% on a disease PRS, what's the probability of disease? Similarly for IQ.

or beyond those?

I rarely make claims about going much beyond natural results; generally I think it's pretty plau... (read more)

According to Nick Cammarata, who rubs shoulders with mystics and billionaires, arahants are as rare as billionaires.

This surprised me, because there are 2+[1] thoroughly-awakened people in my social circle. And that's just in meatspace. Online, I've interacted with a couple others. Plus I met someone with Stream Entry at Less Online last year. That brings the total to a minimum of 4, but it's probably at least 6+.

Meanwhile, I know 0 billionaires.

The explanation for this is obvious; billionaires cluster, as do mystics. If I were a billionaire, then it would be strange for me to not have rubbed shoulders with at least a 6+ other billionaires.

But there's a big difference between billionaires and arahants; it's easy to prove you're a billionaire; just throw a party on your private...

2lsusr
That makes sense. I was misunderstanding your list as "a list of meditation-related things that are difficult to define", and got confused, because it is easy to define what the Qualia Research Institute is.
4Richard_Kennaway
A quiz! (I am jokingly taking this in exactly the spirit you warned against.) No, I've never had anything like this. My attitude is more, shit happens, I deal with it, and move on. (Because what's the alternative? Not dealing with it. Which never works.) Does experiencing my "self" as including all that stuff count? I am guessing not. I have a strong sense of my own continuing presence. Sounds dangerous. It was certainly painful when I closed a car door on my thumbnail a few months ago. (The new thumbnail may have grown back in another few months.) Way beyond me. I seem to score a zero on this. I'm sure I've notched up some 100s of hours of meditation, but spread over a rather large number of years, and rarely a daily practice.

shit happens, I deal with it, and move on. (Because what's the alternative? Not dealing with it.

 

2lsusr
Hahaha! I'm not just talking about your thoughts and feelings. When I say "everything in your consciousness", I mean [what you perceive as] the Sun, other people, mountains in the distance, the dirt on your floor, etc. Not really, unless you plan to light yourself on fire to protest something. It's still unpleasant, and the reactive instinct is still there. I think I hit this stuff with fewer hours of meditation than is typical, and that most people require more hours on the cushion. Also, it depends on what kind of meditation you do. Not everything branded as "meditation" is equally effective at jailbreaking the Matrix. Whether you're doing it badly is illegible too.
unvibe

 

In my daily work as software consultant I'm often dealing with large pre-existing code bases. I use GitHub Copilot a lot. It's now basically indispensable, but I use it mostly for generating boilerplate code, or figuring out how to use a third-party library.
As the code gets more logically nested though, Copilot crumbles under the weight of complexity. It doesn't know how things should fit together in the project.

Other AI tools like Cursor or Devin, are pretty good at generating quickly working prototypes, but they are not great at dealing with large existing codebases, and they have a very low success rate for my kind of daily work.
You find yourself in an endless loop of prompt tweaking, and at that point, I'd rather write the code myself with...

If the code uses other code you've written, how do you ensure all dependencies fit within the context window? Or does this only work for essentially dependencyless code?

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

In my post on value systematization I used utilitarianism as a central example of value systematization.

Value systematization is important because it's a process by which a small number of goals end up shaping a huge amount of behavior. But there's another different way in which this happens: core emotional motivations formed during childhood (e.g. fear of death) often drive a huge amount of our behavior, in ways that are hard for us to notice.

Fear of death and utilitarianism are very different. The former is very visceral and deep-rooted; it typically inf... (read more)

TLDR: I sketch a language for describing (either Bayesian or infra-Bayesian) hypotheses about computations (i.e. computable[1] logical/analytic facts). This is achieved via a novel (AFAIK) branch of automata theory, which generalizes both ordinary and tree automata by recasting them in category-theoretic language. The resulting theory can be equipped with a intriguing self-referential feature: automata that "build" and run new automata in "runtime".

Epistemic status: this is an "extended shortform" i.e. the explanations are terse, the math is a sketch (in particular the proofs are not spelled out) and there might be errors (but probably not fatal errors). I made it a top-level post because it seemed worth reporting and was too long for a shortform.

Motivation

Finite-state automata are a convenient tool for constructing hypotheses classes that admit efficient learning....

More precisely, they are algebras over the free operad generated by the input alphabet of the tree automaton

Wouldn't this fail to preserve the arity of the input alphabet? i.e. you can have trees where a given symbol occurs multiple times, and with different amounts of children? That wouldn't be allowed from the perspective of the tree automaton right?

I recently left OpenAI to pursue independent research. I’m working on a number of different research directions, but the most fundamental is my pursuit of a scale-free theory of intelligent agency. In this post I give a rough sketch of how I’m thinking about that. I’m erring on the side of sharing half-formed ideas, so there may well be parts that don’t make sense yet. Nevertheless, I think this broad research direction is very promising.

This post has two sections. The first describes what I mean by a theory of intelligent agency, and some problems with existing (non-scale-free) attempts. The second outlines my current path towards formulating a scale-free theory of intelligent agency, which I’m calling coalitional agency. 

Theories of intelligent agency

By a “theory of intelligent agency” I mean a...

However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:

  1. Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
  2. Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
  3. Form an EUM which is totally indifferent abo
... (read more)
2plex
Nice! I think you might find my draft on Dynamics of Healthy Systems: Control vs Opening relevant to these explorations, feel free to skim as it's longer than ideal (hence unpublished, despite containing what feels like a general and important insight that applies to agency at many scales). I plan to write a cleaner one sometime, but for now it's claude-assisted writing up my ideas, so it's about 2-3x more wordy than it should be.