From Rethink Priorities:
- We used Monte Carlo simulations to estimate, for various sentience models and across eighteen organisms, the distribution of plausible probabilities of sentience.
- We used a similar simulation procedure to estimate the distribution of welfare ranges for eleven of these eighteen organisms, taking into account uncertainty in model choice, the presence of proxies relevant to welfare capacity, and the organisms’ probabilities of sentience (equating this probability with the probability of moral patienthood)
Now with the disclaimer that I do think that RP are doing good and important work and are one of the few organizations seriously thinking about animal welfare priorities...
Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.
Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neur...
Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.
Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second than human neurons.
epistemic status: Disagreeing on object-level topic, not the topic of EA epistemics.
I disagree, especially functionalism can justify a number like this. Here's an example for reasoning on this:
Under that view, shrimp can absolutely suffer in the same range as humans, and the amount of suffering is dependent on crossing some thresh...
Are there any high p(doom) orgs who are focused on the following:
Seems like this is a good way for people to deploy technical talent in a way which is tractable. There are a lot of people who are smart but not alignment-solving levels of smart who are currently not really able to help.
I'd say that work like our Alignment Faking in Large Language Models paper (and the model organisms/alignment stress-testing field more generally) is pretty similar to this (including the "present this clearly to policymakers" part).
A few issues:
My impression is that the current Real Actual Alignment Plan For Real This Time amongst medium p(Doom) people looks something like this:
(Ignoring the possibility of a pivotal act to shut down AI research. Most people I talk to don't think this is reasonable.)
I'll ignore the practicality of 3. What do people expect 4 to look like? What does an AI assisted value alignment solution look like?
My rough guess of what it could be, i.e. the highest p(solution is this|AI gives us a real alignment solution) is something like the following. This tries to straddle the line between the helper AI being obviously powerful enough to kill us and obviously too dumb to solve alignment:
Too Early does not preclude Too Late
Thoughts on efforts to shift public (or elite, or political) opinion on AI doom.
Currently, it seems like we're in a state of being Too Early. AI is not yet scary enough to overcome peoples' biases against AI doom being real. The arguments are too abstract and the conclusions too unpleasant.
Currently, it seems like we're in a state of being Too Late. The incumbent players are already massively powerful and capable of driving opinion through power, politics, and money. Their products are already too useful and ubiquitous to be hated.
Unfortunately, these can both be true at the same time! This means that there will be no "good" time to play our cards. Superintelligence (2014) was Too Early but not Too Late. There may be opportunities which are Too Late but not Too Early, but (tautologically) these have not yet arrived. As it is, current efforts must fight on bith fronts.
So Sonnet 3.6 can almost certainly speed up some quite obscure areas of biotech research. Over the past hour I've got it to:
Perhaps more importantly, it required almost no mental effort on my part to do this. Barely more than scrolling twitter or watching youtube videos. Actually solving the problems would have had to wait until tomorrow.
I will update in 3 months as to whether Sonnet's idea actually worked.
(in case anyone was wondering, it's not anything relating to protein design lol: Sonnet came up with a high-level strategy for approaching the problem)
The latest recruitment ad from Aiden McLaughlin tells a lot about OpenAI's internal views on model training:
My interpretation of OpenAI's worldview, as implied by this, is:
None of this dramatically conficts with what I already thought OpenAI believed, but it's interesting to get another angle on it.
It's quite possible that 1 is predicated on technical alignment work being done in other parts of the company (though their superalignment team no longer exists) and it's just not seen as the purview of the evals team. If so it's still very optimistic. If there isn't such a team then it's suicidally optimistic.
Fo...
Spoilers (I guess?) for HPMOR
HPMOR presents a protagonist who has a brain which is 90% that of a merely very smart child, but which is 10% filled with cached thought patterns taken directly from a smarter, more experienced adult. Part of the internal tension of Harry is between the un-integrated Dark Side thoughts and the rest of his brain.
Ironic then, that the effect that reading HPMOR---and indeed a lot of Yudkowsky's work---was to imprint a bunch of un-integrated alien thought patterns onto my existing merely very smart brain. A lot of my development over the past few years has just been trying to integrate these things properly with the rest of my mind.
I've been a bit confused about "steering" as a concept. It seems kinda dual to learning, but why? It seems like things which are good at learning are very close to things which are good at steering, but they don't always end up steering. It also seems like steering requires learning. What's up here?
I think steering is basically learning, backwards, and maybe flipped sideways. In learning, you build up mutual information between yourself and the world; in steering, you spend that mutual information. You can have learning without ...
The hypothetical ammonia-reduction-in-shrimp-farm intervention has been touted as 1-2 OOMs more effective than shrimp stunning.
I think this is probably an underestimate, because I think that the estimates of shrimp suffering during death are probably too high.
(While I'm very critical of all of RP's welfare range estimates, including shrimp, that's not my point here. This argument doesn't rely on any arguments about shrimp welfare ranges overall. I do compare humans and shrimp, but IIUC this sort of comparison is the thing you multiply b...
As much as the amount of fraud (and lesser cousins thereof) in science is awful as a scientist, it must be so much worse as a layperson. For example this is a paper I found today suggesting that cleaner wrasse, a type of finger-sized fish, can not only pass the mirror test, but are able to remember their own face and later respond the same way to a photograph of themselves as to a mirror.
https://www.pnas.org/doi/10.1073/pnas.2208420120
Ok, but it was published in PNAS. As a researcher I happen to know that PNAS allows for special-track submissions from memb...
https://threadreaderapp.com/thread/1925593359374328272.html
Reading between the lines here, Opus 4 was RLed by repeated iterating and testing. Seems like they had to hit it fairly (for Anthropic) hard with the "Identify specific bad behaviors and stop them" technique.
Relatedly: Opus 4 doesn't seem to have the "good vibes" that Opus 3 had.
Furthermore, this (to me) indicates that Anthropic's techniques for model "alignment" are getting less elegant and sophisticated over time, since the models are getting smarter---and thus harder to "align"---faster than Ant...
There's a court at my university accommodation that people who aren't Fellows of the college aren't allowed on, it's a pretty medium-sized square of mown grass. One of my friends said she was "morally opposed" to this (on biodiversity grounds, if the space wasn't being used for people it should be used for nature).
And I couldn't help but think, how tiring it would be to have a moral-feeling-detector this strong. How could one possibly cope with hearing about burglaries, or North Korea, or astronomical waste.
I've been aware of scope insensitivity for a long time now but, this just really put things in perspective in a visceral way for me.
Seems like if you're working with neural networks there's not a simple map from an efficient (in terms of program size, working memory, and speed) optimizer which maximizes X to an equivalent optimizer which maximizes -X. If we consider that an efficient optimizer does something like tree search, then it would be easy to flip the sign of the node-evaluating "prune" module. But the "babble" module is likely to select promising actions based on a big bag of heuristics which aren't easily flipped. Moreover, flipping a heuristic which upweights a small subset ...
I have no evidence for this but I have a vibe that if you build a proper mathematical model of agency/co-agency, then prediction and steering will end up being dual to one another.
My intuition why:
A strong agent can easily steer a lot of different co-agents; those different co-agents will be steered towards the same goals of the agent.
A strong co-agent is easily predictable by a lot of different agents; those different agents will all converge on a common map of the co-agent.
Also, category theory tells us that there is normally only one kind of thing, but ...
Logical inductors consider belief-states as prices over logical sentences in some language, with the belief-states decided by different computable "traders", and also some decision process which continually churns out proofs of logical statements in that language. This is a bit unsatisfying, since it contains several different kinds of things.
What if, instead of buying shares in logical sentences, the traders bought shares in each other. Then we only need one kind of thing.
Let's make this a bit more precise:
Thinking back to the various rationalist attempts to make vaccine. https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vaccine For bird-flu related reasons. Since then, we've seen mRNA vaccines arise as a new vaccination method. mRNA vaccines have been used intra-nasally for COVID with success in hamsters. If one can order mRNA for a flu protein, it would only take mixing that with some sort of delivery mechanism (such as Lipofectamine, which is commercially available) and snorting it to get what could actually be a pretty good vaccine. Has RaDVac or similar looked at this?
Seems like there's a potential solution to ELK-like problems. If you can force the information to move from the AI's ontology to (it's model of) a human's ontology and then force it to move it back again.
This gets around "basic" deception since we can always compare the AI's ontology before and after the translation.
The question is how do we force the knowledge to go through the (modeled) human's ontology, and how do we know the forward and backward translators aren't behaving badly in some way.
Rather than using Bayesian reasoning to estimate P(A|B=b) it seems like most people the following heuristic:
This is how you get "Saint Austacious could levitate, therefore God", since given [levitating saint] AND [God exists] there is very little uncertainty over what happened. Whereas given [levitating saint] AND [no God] there's a lot still left to wonder about regarding who made up the story at what point.
Getting rid of guilt and shame as motivators of people is definitely admirable, but still leaves a moral/social question. Goodness or Badness of a person isn't just an internal concept for people to judge themselves by, it's also a handle for social reward or punishment to be doled out.
I wouldn't want to be friends with Saddam Hussein, or even a deadbeat parent who neglects the things they "should" do for their family. This also seems to be true regardless of whether my social punishment or reward has the ability to change these people's behaviour. B...
Alright so we have:
- Bayesian Influence Functions allow us to find a training data:output loss correspondence
- Maybe the eigenvalues of the eNTK (very similar to influence function) corresponds to features in the data
- Maybe the features in the dataset can be found with an SAE
Therefore (will test this later today) maybe we can use SAE features to predict the influence function.
An early draft of a paper I'm writing went like this:
In the absence of sufficient sanity, it is highly likely that at least one AI developer will deploy an untrusted model: the developers do not know whether the model will take strategic, harmful actions if deployed. In the presence of a smaller amount of sanity, they might deploy it within a control protocol which attempts to prevent it from causing harm.
I had to edit it slightly. But I kept the spirit.
There's lots of discourse around at the moment about
I present a synthesis:
If you disagree with either of these, you might not want to halt now:
The constant hazard rate model probably predicts exponential training inference (i.e. the inference done during guess and check RL) compute requirements agentic RL with a given model, because as hazard rate decreases exponentially, we'll need to sample exponentially more tokens to see an error, and we need to see an error to get any signal.
Hypothesis: one type of valenced experience---specifically valenced experience as opposed to conscious experience in general, which I make no claims about here---is likely to only exist in organisms with the capability for planning. We can analogize with deep reinforcement learning: seems like humans have a rapid action-taking system 1 which is kind of like Q-learning, it just selects actions; we also have a slower planning-based system 2, which is more like value learning. There's no reason to assign valence to a particular mental state if you're not able to imagine your own future mental states. There is of course moment-to-moment reward-like information coming in, but that seems to be a distinct thing to me.
Heuristic explanation for why MoE gets better at higher model size:
The input/output of a feedforward layer is equal to the model_width, but the total size of weights grows as model_width squared. Superposition helps explain how a model component can make the most use of its input/output space (and presumably its parameters) using sparse overcomplete features, but in the limit, the amount of information accessed by the feedforward call scales with the number of active parameters. Therefore at some point, more active parameters won't scale so well, since you're "accessing" too much "memory" in the form of weights, and overwhelming your input/output channels.
If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is "baked into" the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic's observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html
I think you should pay in Counterfactual Mugging, and this is one of the newcomblike problem classes that is most common in real life.
Example: you find a wallet on the ground. You can, from least to most pro social:
Let's ignore the first option (suppose we're not THAT evil). The universe has randomly selected you today to be in the position where your only options are to spend some resources to no personal gain, or not. In a parallel universe, perhaps...
The UK has just switched their available rapid Covid tests from a moderately unpleasant one to an almost unbearable one. Lots of places require them for entry. I think the cost/benefit makes sense even with the new kind, but I'm becoming concerned we'll eventually reach the "imagine a society where everyone hits themselves on the head every day with a baseball bat" situation if cases approach zero.
Just realized I'm probably feeling much worse than I ought to on days when I fast because I've not been taking sodium. I really should have checked this sooner. If you're planning to do long (I do a day, which definitely feels long) fasts, take sodium!