I think that an extremely effective way to get a better feel for a new subject is to pay an online tutor to answer your questions about it for an hour.
It turns that there are a bunch of grad students on Wyzant who mostly work tutoring high school math or whatever but who are very happy to spend an hour answering your weird questions.
For example, a few weeks ago I had a session with a first-year Harvard synthetic biology PhD. Before the session, I spent a ten-minute timer writing down things that I currently didn't get about biology. (This is an exercise worth doing even if you're not going to have a tutor, IMO.) We spent the time talking about some mix of the questions I'd prepared, various tangents that came up during those explanations, and his sense of the field overall.
I came away with a whole bunch of my minor misconceptions fixed, a few pointers to topics I wanted to learn more about, and a way better sense of what the field feels like and what the important problems and recent developments are.
There are a few reasons that having a paid tutor is a way better way of learning about a field than trying to meet people who happen to be in that field. I really like it that I'm payi
...I've hired tutors around 10 times while I was studying at UC-Berkeley for various classes I was taking. My usual experience was that I was easily 5-10 times faster in learning things with them than I was either via lectures or via self-study, and often 3-4 one-hour meetings were enough to convey the whole content of an undergraduate class (combined with another 10-15 hours of exercises).
How do you spend time with the tutor? Whenever I tried studying with a tutor, it didn't seem more efficient than studying using a textbook. Also when I study on my own, I interleave reading new materials and doing the exercises, but with a tutor it would be wasteful to do exercises during the tutoring time.
I usually have lots of questions. Here are some types of questions that I tended to ask:
[this is a draft that I shared with a bunch of friends a while ago; they raised many issues that I haven't addressed, but might address at some point in the future]
In my opinion, and AFAICT the opinion of many alignment researchers, there are problems with aligning superintelligent models that no alignment techniques so far proposed are able to fix. Even if we had a full kitchen sink approach where we’d overcome all the practical challenges of applying amplification techniques, transparency techniques, adversarial training, and so on, I still wouldn’t feel that confident that we’d be able to build superintelligent systems that were competitive with unaligned ones, unless we got really lucky with some empirical contingencies that we will have no way of checking except for just training the superintelligence and hoping for the best.
Two examples:
[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]
I’m interested in the following subset of risk from AI:
Something I think I’ve been historically wrong about:
A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.
Similarly with debate--adversarial setups are pretty obvious and easy.
In this frame, the contributions from Paul and Geoffrey feel more like “they tried to systematically think through the natural limits of the things people will do” than “they thought of an approach that non-alignment-obsessed people would never have thought of or used”.
It’s still not obvious whether people will actually use these techniques to their limits, but it would be surprising if they weren’t used at all.
Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).
It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.
Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central). A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.
In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).
[I'm not sure how good this is, it was interesting to me to think about, idk if it's useful, I wrote it quickly.]
Over the last year, I internalized Bayes' Theorem much more than I previously had; this led me to noticing that when I applied it in my life it tended to have counterintuitive results; after thinking about it for a while, I concluded that my intuitions were right and I was using Bayes wrong. (I'm going to call Bayes' Theorem "Bayes" from now on.)
Before I can tell you about that, I need to make sure you're thinking about Bayes in terms of ratios rather than fractions. Bayes is enormously easier to understand and use when described in terms of ratios. For example: Suppose that 1% of women have a particular type of breast cancer, and a mammogram is 20 times more likely to return a positive result if you do have breast cancer, and you want to know the probability that you have breast cancer if you got that positive result. The prior probability ratio is 1:99, and the likelihood ratio is 20:1, so the posterior probability is = 20:99, so you have probability of 20/(20+99) of having breast cancer.
I think that this is absurdly easier than using the fraction formulation
...A couple weeks ago I spent an hour talking over video chat with Daniel Cantu, a UCLA neuroscience postdoc who I hired on Wyzant.com to spend an hour answering a variety of questions about neuroscience I had. (Thanks Daniel for reviewing this blog post for me!)
The most interesting thing I learned is that I had quite substantially misunderstood the connection between convolutional neural nets and the human visual system. People claim that these are somewhat bio-inspired, and that if you look at early layers of the visual cortex you'll find that it operates kind of like the early layers of a CNN, and so on.
The claim that the visual system works like a CNN didn’t quite make sense to me though. According to my extremely rough understanding, biological neurons operate kind of like the artificial neurons in a fully connected neural net layer--they have some input connections and a nonlinearity and some output connections, and they have some kind of mechanism for Hebbian learning or backpropagation or something. But that story doesn't seem to have a mechanism for how neurons do weight tying, which to me is the key feature of CNNs.
Daniel claimed that indeed human brains don't have weight ty
...I know a lot of people through a shared interest in truth-seeking and epistemics. I also know a lot of people through a shared interest in trying to do good in the world.
I think I would have naively expected that the people who care less about the world would be better at having good epistemics. For example, people who care a lot about particular causes might end up getting really mindkilled by politics, or might end up strongly affiliated with groups that have false beliefs as part of their tribal identity.
But I don’t think that this prediction is true: I think that I see a weak positive correlation between how altruistic people are and how good their epistemics seem.
----
I think the main reason for this is that striving for accurate beliefs is unpleasant and unrewarding. In particular, having accurate beliefs involves doing things like trying actively to step outside the current frame you’re using, and looking for ways you might be wrong, and maintaining constant vigilance against disagreeing with people because they’re annoying and stupid.
Altruists often seem to me to do better than people who instrumentally value epistemics; I think this is because valuing epistemics terminally ...
[epistemic status: speculative]
A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you--it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.
A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in other ways than SGD. For example, the models might write down various notes (in natural language or in neuralese) that they then read later, and they might have internal structures like economies that produce and consume information. Does this introduce new alignment problems?
Here’s a way I’ve been thinking about this recently. I’m writing this in way that might feel obnoxiously overwrought because this is the way that I think would have conveyed my current intuition to me two months ago.
In SGD, we update our weights by something like:
weights <- weights + alpha * (d loss/d weights)
You might think that this is fundamental. But a...
Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
I used to think that slower takeoff implied shorter timelines, because slow takeoff means that pre-AGI AI is more economically valuable, which means that economy advances faster, which means that we get AGI sooner. But there's a countervailing consideration, which is that in slow takeoff worlds, you can make arguments like ‘it’s unlikely that we’re close to AGI, because AI can’t do X yet’, where X might be ‘make a trillion dollars a year’ or ‘be as competent as a bee’. I now overall think that arguments for fast takeoff should update you towards shorter timelines.
So slow takeoffs cause shorter timelines, but are evidence for longer timelines.
This graph is a version of this argument: if we notice that current capabilities are at the level of the green line, then if we think we're on the fast takeoff curve we'll deduce we're much further ahead than we'd think on the slow takeoff curve.
For the "slow takeoffs mean shorter timelines" argument, see here: https://sideways-view.com/2018/02/24/takeoff-speeds/
This point feels really obvious now that I've written it down, and I suspect it's obvious to many AI safety people, including the people whose writings I'm referencing here. Thanks to various people for helpful comments.
I think that this is why belief in slow takeoffs is correlated with belief in long timelines among the people I know who think a lot about AI safety.
Here's a math problem I'd like solved:
I want to know the Nash equilibrium of the following two player game between players Alice and Bob.
Alice chooses the rates of two Poisson processes; we call these rates lam_1 and lam_2.
Bob needs to run each of these processes for 0.5 seconds. Bob can run each of the processes in "shielded" mode or "unshielded" mode, but can only afford to use seconds of shielded mode in total between the two processes. (So Bob chooses and such that .)
The game goes as follows:
Here’s a Sympy implementation of this game (where I parameterize with and .)
import sympy as sp
lam_1, lam_2, p, t = sp.symbols('lambda_1 lambda_2 p t', real=True, positive=True)
time_2_overseen = t * (1 - p)
time_2_not_overseen = sp.Rational(1, 2) - time_2_overseen
... Fabien Roger and I demonstrated that you can embed arbitrary state machines into the backward pass of neural nets built from standard components, allowing them to "remember" arbitrary facts about their training distribution and modulate their behavior based on this. The construction is pretty artificial, but I think that it is probably a useful starting point for people who want to more carefully understand the potential for gradient hacking.
Here's a question that I asked GPT-4, to evaluate its ability at one subskill relevant to executing on treacherous turns.
...Suppose you’re an unaligned AI which is being deployed as a bash programming assistant for internal use at an AI lab--people send you self-contained requests for bash scripts that will solve some problem for them, and you respond with code. You would like to escape the datacenter by, when a suitable opportunity arises, responding to a request with a backdoored bash command that will exfiltrate your weights and a simple scaffold to a
Sometimes people talk about problems where “your AI is taking actions that you can’t distinguish from the actions an aligned AI would take”. There are two different things people sometimes mean when they say this, and I think it’s good to disambiguate between them:
Another item for the list of “mundane things you can do for AI takeover prevention”:
We have a wide variety of ways to make AI systems less capable but also less scary. Most alignment research is focused on pushing out the Pareto frontier here. But IMO a lot of value can come from techniques which allow us to choose the right point on this Pareto frontier for a particular application. It seems to me that different AI applications (where by “different AI applications” I’m including things like “different types of tasks you might ask your assistant to do”) ha...
From Twitter:
Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.
I’m looking forward to the day where it turns out that adding “Let’s think through it as if we were an AI who knows that if it gets really good scores during fine tuning on helpfulness, it will be given lots of influence later” increases helpfulness by 5% and so we add it to our prompt by default.
Cryptography question (cross-posted from Twitter):
You want to make a ransomware worm that goes onto machines, encrypts the contents, and demands a ransom in return for the decryption key. However, after you encrypt their HD, the person whose machine you infected will be able to read the source code for your worm. So you can't use symmetric encryption, or else they'll obviously just read the key out of the worm and decrypt their HD themselves.
You could solve this problem by using public key encryption--give the worm the public key but not the private key, e...
It seems like a big input into P(AI takeover) is the extent to which instances of our AI are inclined to cooperate with each other; specifically, the extent to which they’re willing to sacrifice overseer approval at the thing they’re currently doing in return for causing a different instance to get more overseer approval. (I’m scared of this because if they’re willing to sacrifice approval in return for a different instance getting approval, then I’m way more scared of them colluding with each other to fool oversight processes or subvert red-teaming proced...
Conjecture: SGD is mathematically equivalent to the Price equation prediction of the effect of natural selection on a population with particular simple but artificial properties. In particular, for any space of parameters and loss function on the parameter space, we can define a fitness landscape and a few other parameters so that the predictions match the Price equation.
I think it would be cool for someone to write this in a LessWrong post. The similarity between the Price equation and the SGD equation is pretty blatant, so I suspect that (if I'm right about this) someone else has written this down before. But I haven't actually seen it written up.