All of A Ray's Comments + Replies

I like pointing out this confusion.  Here's a grab-bag of some of the things I use it for, to try to pull them apart:

  • actors/institutions far away from the compute frontier produce breakthroughs in AI/AGI tech (juxtaposing "only the top 100 labs" vs "a couple hackers in a garage")
  • once a sufficient AI/AGI capability is reached, that it will be quickly optimized to use much less compute
  • amount of "optimization pressure" (in terms of research effort) pursuing AI/AGI tech, and the likelihood that they missed low-hanging fruit
  • how far AI/AGI research/products
... (read more)

Comparing AI Safety-Capabilities Dilemmas to Jervis' Cooperation Under the Security Dilemma

I've been skimming some things about the Security Dilemma (specifically Offense-Defense Theory) while looking for analogies for strategic dilemmas in the AI landscape.

I want to describe a simple comparison here, lightly held (and only lightly studied)

  • "AI Capabilities" -- roughly, the ability to use AI systems to take (strategically) powerful actions -- as "Offense"
  • "AI Safety" -- roughly, that AI systems under control and use do not present a catastrophic/existential
... (read more)

I largely agree with the above, but commenting with my own version.

What I think companies with AI services should do:

Can be done in under a week:

  1. Have a monitored communication channel for people, primarily security researchers, to responsibly disclose potential issues ("Potential Flaw")
    1. Creating an email (ml-disclosures@) which forwards to an appropriate team
    2. Submissions are promptly responded to with a positive receipt ("Vendor Receipt")
  2. Have clear guidance (even just a blog post or similar) about what constitutes an issue worth reporting ("Vulnerability")
    1. Ev
... (read more)

Weakly positive on this one overall.  I like Coase's theory of the firm, and like making analogies with it to other things.  I don't think this application felt like it quite worked to me, and trying to write up why.

One thing is I think feels off is an incomplete understanding of the Coase paper.  What I think the article gets correct: Coase looks at the difference between markets (economists preferred efficient mechanism) and firms / corporation, and observes that transaction costs (for people these would be contracts, but in general all tr... (read more)

This post was personally meaningful to me, and I'll try to cover that in my review while still analyzing it in the context of lesswrong articles.

I don't have much to add about the 'history of rationality' or the description of interactions of specific people.

Most of my value from this post wasn't directly from the content, but how the content connected to things outside of rationality and lesswrong.  So, basically, i loved the citations.

Lesswrong is very dense in self-links and self-citations, and to a lesser degree does still have a good number of li... (read more)

I read this sequence and then went through the whole thing.  Without this sequence I'd probably still be procrastinating / putting it off.  I think everything else I could write in review is less important than how directly this impacted me.

Still, a review: (of the whole sequence, not just this post)

First off, it signposts well what it is and who it's for.  I really appreciate when posts do that, and this clearly gives the top level focus and whats in/out.

This sequence is "How to do a thing" - a pretty big thing, with a lot of steps and bran... (read more)

Summary

  • public discourse of politics is too focused on meta and not enough focused on object level
  • the downsides are primarily in insufficient exploration of possibility space

Definitions

  • "politics" is topics related to government, especially candidates for elected positions, and policy proposals
  • opposite of meta is object level - specific policies, or specific impacts of specific actions, etc
  • "meta" is focused on intangibles that are an abstraction away from some object-level feature, X, e.g. someones beliefs about X, or incentives around X, or media coverage v
... (read more)

Thoughts, mostly on an alternative set of next experiments:

I find interpolations of effects to be a more intuitive way to study treatment effects, especially if I can modulate the treatment down to zero in a way that smoothly and predictably approaches the null case.  It's not exactly clear to me what the "nothing going on case is", but here's some possible experiments to interpolate between it and your treatment case.

  • alpha interpolation noise: A * noise + (A - 1) * MNIST, where the null case is the all-noise case.  Worth probably trying out a bu
... (read more)
A RayΩ111

I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.

Trying to encode an additional penalty on changing non-semantic information is an interesting idea.

However I think you're missing that you don't have the ability to directly compare to a reference LM in cases where you're training to improve on some performance benchmark.  During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).

So your proposed d... (read more)

A RayΩ221

If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.

This also applies to changes during training where the model is learning to perform better on the objective task.

So we are expecting some amount of KL divergence already.

My claims are:

  1. The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)

  2. optimization pressure will try to push this extra information into the cheapest places to hide

  3. the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task

4Quintin Pope
I think it probably works out in the other direction. The easiest way to encode problem-relevant info is by changing the conceptual content of the generated text, not weird changes to the grammatical rules. This is because KL penalties and cross entropy loss actually prioritize modeling exact grammatical patterns over conceptual content. E.g., GPTs will quickly learn to generate ~perfect grammar, but are slow to pick up on / produce coherent conceptual content. Another framing is to think of the KL regularizated capabilities training as Bayes updating of the model’s priors on the distribution over text. The model should have much stronger priors on grammatical rules than on higher level concepts. So, the update probably changes conceptual patterns more than grammatical ones. Can’t you just compare the token predictions of the new model against the reference LM? Then you can check if they disagree on weird grammatical patterns or on conceptual tokens that are related to a human-like understanding of how to tackle the problem. Maybe you can even bias the KL penalty to apply higher weight to stopwords, punctuation, and synonym choices?

I think that’s right but also the moral maze will be mediating the information and decision making support that’s available to the leadership, so they’re not totally immune from the influences

A RayΩ220

I think I understand what you're saying, but I want to double check and try laying it out explicitly.

  • The "null space" is actually the null space of the prior model, which is human-like, but not exactly lined up to human perception of language
  • Because of this, the information is "hidden" w.r.t. human perceptions, it's "hidden" w.r.t. the language model prior

I think I agree with both of these points, but here's my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be "human")

  • Empirically drawing many samples
... (read more)
1Quintin Pope
Apologies for my first comment. I was tired, and most of what I said about KL divergence is actually wrong. The core issue is that you can't actually avoid the KL penalty by hiding in the nullspace. That's just not how KL divergence works. A KL divergence of zero implies that the two LMs specify exactly the same distribution over tokens. If you deviate away from the reference LM's prior in any manner, then you take a KL penalty.  E.g., if the reference LM is indifferent between the phrases “My stomach is grumbling” or “I could use a snack”, but the steganographic LM always uses “My stomach is grumbling”, then this deviation is also penalized by KL divergence, regardless of whether the reference LM's indifference between the two phrasings means you call this part of the LM's "nullspace".
A RayΩ340

I think this is worth considering even in cases of things like recursive prompting for a couple of reasons that I haven't organized:

  • In general additional compute time lets you improve your answer, so by default I'd expect answers to any significant question to take multiple steps
    • I agree for trivial questions it might not do any of this.  My expectation is that AI deployments impacts will mostly be on things like non-trivial questions or tasks
  • So far current research hasn't done a great deal of regularizing or penalizing models for using a lot of thinki
... (read more)
A RayΩ330

Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority.

However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points):

  • Reducing the channel capacity, and thus the amount of information that can be transmitted
    • Constraining the outputs, and thus reducing the degrees of freedom of the null space
    • Adding noise (via dropout or otherwise) also reduces the channel capaci
... (read more)
1Hoagy
Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn't a certainty. They're a bit out of date but the original proposal is ELK Sub - Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages. Haven't done any proper testing but it's high up on my agenda, would be interested in working out what tests would be best - will elaborate later.
A RayΩ560

Agree that founders are a bit of an exception.  Actually that's a bit in the longer version of this when I talk about it in person.

Basically: "The only people who at the very top of large tech companies are either founders or those who were able to climb to the tops of moral mazes".

So my strategic corollary to this is that it's probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often.

In the case of facebook, even in the face of all of their history of actions, I think on the margin I'd pr... (read more)

1Ivan Vendrov
Agreed on all points! One clarification is that large founder-led companies, including Facebook, are all moral mazes internally (i.e. from the perspective of the typical employee); but their founders often have so much legitimacy that their external actions are only weakly influenced by moral maze dynamics. I guess that means that if AGI deployment is very incremental - a sequence of small changes to many different AI systems, that only in retrospect add up to AGI - moral maze dynamics will still be paramount, even in founder-led companies.

Thanks, fixed the link in the article.  Should have pointed here: https://www.lesswrong.com/posts/dhj9dhiwhq3DX6W8z/hero-licensing

A RayΩ7120

I think there should be a norm about adding the big-bench canary string to any document describing AI evaluations in detail, where you wouldn't want it to be inside that AI's training data.

Maybe in the future we'll have a better tag for "dont train on me", but for now the big bench canary string is the best we have.

This is in addition to things like "maybe don't post it to the public internet" or "maybe don't link to it from public posts" or other ways of ensuring it doesn't end up in training corpora.

I think this is a situation for defense-in-depth.

2Daniel Kokotajlo
What is the canary exactly? I'd like to have a handy reference to copy-paste that I can point people to. Google fails me.

More Ideas or More Consensus?

I think one aspect you can examine about a scientific field is it's "spread"-ness of ideas and resources.

High energy particle physics is an interesting extrema here -- there's broad agreement in the field about building higher energy accelerators, and this means there can be lots of consensus about supporting a shared collaborative high energy accelerator.

I think a feature of mature scientific fields that "more consensus" can unlock more progress.  Perhaps if there had been more consensus, the otherwise ill-fated supercond... (read more)

A RayΩ8170

AGI will probably be deployed by a Moral Maze

Moral Mazes is my favorite management book ever, because instead of "how to be a good manager" it's about "empirical observations of large-scale organizational dynamics involving management".

I wish someone would write an updated version -- a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.

My take (and the author's take) is that any company of nontrivial size begins to take on the characteristics of a moral maze.  It seems to be a pretty good nul... (read more)

1Ivan Vendrov
Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes. Facebook's pivot to the "metaverse", for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was overwhelmingly likely to deploy one of the first AGIs, I expect Mark Zuckerberg's beliefs about AGI safety would be more important to understand than the general dynamics of moral mazes. (Facebook example deliberately chosen to avoid taking stances on the more likely AGI players, but I think it's relatively clear which ones are moral mazes).

(Caveat: I ran the first big code scrape and worked on the code generating models which later became codex.)

My one line response: I think opt-out is obviously useful and good and should happen.

AFAIK there are various orgs/bodies working on this but kinda blanking what/where.  (In particular there's a FOSS mailing list that's been discussing how ML training relates to FOSS license rights that seems relevant)

Opt-out strings exist today, in an insufficient form.  The most well known and well respected one is probably the big-bench canary string: htt... (read more)

Sometimes I get asked by intelligent people I trust in other fields, "what's up with AI x risk?" -- and I think at least part of it unpacks to this: Why don't more people believe in / take seriously AI x-risk?

I think that is actually a pretty reasonable question.  I think two follow-ups are worthwhile and I don't know of good citations / don't know if they exist:

  1. a sociological/anthropological/psychological/etc study of what's going on in people who are familiar with the ideas/reasonings of AI x-risk, but decide not to take it seriously / don't believe
... (read more)
A RayΩ7120

Thanks so much for making this!

I'm hopeful this sort of dataset will grow over time as new sources come about.

In particular, I'd nominate adding MLSN (https://www.alignmentforum.org/posts/R39tGLeETfCZJ4FoE/mlsn-4-many-new-interpretability-papers-virtual-logit) to the list of newsletters in the future.

2Ethan Perez
Yes super excited about datasets like this! It might be helpful to also add https://ai-alignment.com/ or https://paulfchristiano.medium.com/ if these aren't already in the data
1jacquesthibs
Good idea! I added most of the papers from the previous entries of MLSN. Adding the summaries would be a useful next step. Would be great if someone could keep track of it in a Google Sheet of individual summaries like the Alignment Newsletter (https://docs.google.com/spreadsheets/d/1lJ6431R-E6aioVRd7AN4LQYTj-QhQlUYNRbGDbG5RWY/edit?usp=sharing). I was also considering adding distillations as a key as well. For example, adding ELK distillations to the ELK report entry.
A RayΩ110

This seems like an overly alarmist take on what is a pretty old trend of research.  Six years ago there was a number of universities working on similar models for the VizDoom competition (IIRC they were won by Intel and Facebook).  It seems good to track this kind of research, but IMO the conclusions here are not supported at all by the evidence presented.

3RomanS
The trend of research is indeed old. But in this case, my alarmism is based on the combination of the following factors: * the environment is much closer to a real battlefield than Doom etc * the AI is literally optimized for killing humans (or more precisely, entities that look and behave very much like humans) * judging by the paper, the AI was surprisingly cheap to create (a couple of GPUs + a few days of publicly available streams). It was also cheap to run in real time (1 GPU) * the research was done at a university that is doing AI research for the military * it is China, a totalitarian dictatorship that is currently perpetrating a genocide. And it is known for using AI as one of the tools (e.g. for mass surveillance of Uyghurs)
A RayΩ110

Do you have suggestions for domains where you do expect one-turn debate to work well, now that you've got these results?

1Sam Bowman
I have no reason to be especially optimistic given these results, but I suppose there may be some fairly simple questions for which it's possible to enumerate a complete argument in a way that flaws will be clearly apparent. In general, it seems like single-turn debate would have to rely on an extremely careful judge, which we don't quite have, given the time constraint. Multi-turn seems likely to be more forgiving, especially if the judge has any influence over the course of the debate.

Congratulations!  Can you say if there will be a board, and if so who will start on it?

2Connor Leahy
Currently, there is only one board position, which I hold. I also have triple vote as insurance if we decide to expand the board. We don’t plan to give up board control.

Longtermist X-Risk Cases for working in Semiconductor Manufacturing

Two separate pitches for jobs/roles in semiconductor manufacturing for people who are primarily interested in x-risk reduction.

Securing Semiconductor Supply Chains

This is basically the "computer security for x-risk reduction" argument applied to semiconductor manufacturing.

Briefly restating: it seems exceedingly likely that technologies crucial to x-risks are on computers or connected to computers.  Improving computer security increases the likelihood that those machines are not stolen... (read more)

A RayΩ330

I think your explanation of legibility here is basically what I have in mind, excepting that if it's human designed it's potentially not all encompassing.  (For example, a world model that knows very little, but knows how to search for information in a library)

I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system.  My take is that it is not "interpretability" to understand a legible system, but maybe I'm using the term differently than others here.  This is why I don't think "... (read more)

3Steven Byrnes
In what sense? Your world-model is built out of ~100 trillion synapses, storing all sorts of illegible information including “the way my friend sounds when he talks with his mouth full” and “how it feels to ride a bicycle whose gears need lubrication”. That seems very different though! The GPT-3 source code is rather compact (gradient descent etc.); combine it with data and you get a huge and extraordinarily complicated illegible world-model (or just plain “model” in the GPT-3 case, if you prefer). Likewise, the human brain has a learning algorithm that builds a world-model. The learning algorithm is (I happen to think) a compact easily-human-legible algorithm involving pattern recognition and gradient descent and so on. But the world-model built by that learning algorithm is super huge and complicated. Sorry if I’m misunderstanding. I’ll try to walk through why I think “coming up with new concepts outside what humans have thought of” is required. We want an AGI to be able to do powerful things like independent alignment research and inventing technology. (Otherwise, it’s not really an AGI, or at least doesn’t help us solve the problem that people will make more dangerous AGIs in the future, I claim.) Both these things require finding new patterns that have not been previously noticed by humans. For example, think of the OP that you just wrote. You had some idea in your head—a certain visualization and associated bundle of thoughts and intuitions and analogies—and had to work hard to try to communicate that idea to other humans like me. Again, sorry if I’m misunderstanding.
A RayΩ690

Two Graphs for why Agent Foundations is Important (according to me)

Epistemic Signpost: These are high-level abstract reasons, and I don’t go into precise detail or gears-level models.  The lack of rigor is why I’m short form-ing this.

First Graph: Agent Foundations as Aligned P2B Fixpoint

P2B (a recursive acronym for Plan to P2B Better) is a framing of agency as a recursively self-reinforcing process.  It resembles an abstracted version of recursive self improvement, which also incorporates recursive empowering and recursive resource gathering. &nb... (read more)

3Steven Byrnes
RE legibility: In my mind, I don’t normally think there’s a strong connection between agent foundations and legibility. If the AGI has a common-sense understanding of the world (which presumably it does), then it has a world-model, full of terabytes of information of the sort “tires are usually black” etc. It seems to me that either the world-model will be either built by humans (e.g. Cyc), or (much more likely) learned automatically by an algorithm, and if it’s the latter, it will be unlabeled by default, and it’s on us to label it somehow, and there’s no guarantee that every part of it will be easily translatable to human-legible concepts (e.g. the concept of “superstring” would be hard to communicate to a person in the 19th century). But everything in that paragraph above is “interpretability”, not “agent foundations”, at least in my mind. By contrast, when I think of “agent foundations”, I think of things like embedded agency and logical induction and so on. None of these seem to be related to the problem of world-models being huge and hard-to-interpret. Again, world-models must be huge and complicated, because the world is huge and complicated. World-models must have hard-to-translate concepts, because we want AGI to come up with new ideas that have never occurred to humans. Therefore world-model interpretability / legibility is going to be a big hard problem. I don’t see how “better understanding the fundamental nature of agency” will change anything about that situation. Or maybe you’re thinking “at least let’s try to make something more legible than a giant black box containing a mesa-optimizer”, in which case I agree that that’s totally feasible, see my discussion here.
A RayΩ330

Maybe useful: an analogy this post brought to mind for me: Replacing “AI” with “Animals”.

Hypothetical alien civilization, observing Early Earth and commenting on whether it poses a risk.

Doesn’t optimization nature produce non-agentic animals?  It mostly does, but those aren’t the ones we’re concerned with.  The risk is all concentrated in the agentic animals.

Basically every animal ever is not agentic.  I’ve studied animals for my entire career and I haven’t found an agentic animal yet.  That doesn’t preclude them showing up in the futur... (read more)

A RayΩ230

Hacking the Transformer Prior

Neural Network Priors

I spend a bunch of time thinking about the alignment of the neural network prior for various architectures of neural networks that we expect to see in the future.

Whatever alignment failures are highly likely under the neural network prior are probably worth a lot of research attention.

Separately, it would be good to figure out knobs/levers for changing the prior distribution to be more aligned (or produce more aligned models).  This includes producing more interpretable models.

Analogy to Software Devel... (read more)

4Vaniver
I'm pretty sure you mean functions that perform tasks, like you would put in /utils, but I note that on LW "utility function" often refers to the decision theory concept, and "what decision theoretical utility functions are present in the neural network prior" also seems like an interesting (tho less useful) question.
A RayΩ580

I think there’s a lot going on with your equivocating the speed prior over circuits w/ a speed prior over programs.


 

I think a lot of the ideas in this direction are either confused by the difference between circuit priors and program priors, or at least treating them as equivalent.  Unfortunately a lot of this is vague until you start specifying the domain of model.  I think specifying this more clearly will help communicating about these ideas.  To start with this myself, when I talk about circuit induction, I’m talking about things th... (read more)

A RayΩ370

Interpretability Challenges

Inspired by a friend I've been thinking about how to launch/run interpretability competitions, and what the costs/benefits would be.

I like this idea a lot because it cuts directly at one of the hard problems of spinning up in interpretability research as a new person.  The field is difficult and the objectives are vaguely defined; it's easy to accidentally trick yourself into seeing signal in noise, and there's never certainty that the thing you're looking for is actually there.

On the other hand, most of the interpretability... (read more)

My Cyberwarfare Concerns: A disorganized and incomplete list

  • A lot of internet infrastructure (e.g. BGP / routing) basically works because all the big players mostly cooperate.  There have been minor incidents and attacks but nothing major so far.  It seems likely to be the case that if a major superpower was backed into a corner, it could massively disrupt the internet, which would be bad.
  • Cyberwar has a lot of weird asymmetries where the largest attack surfaces are private companies (not militaries/governments).  This gets weirder when priva
... (read more)
1[comment deleted]
2Donald Hobson
Sure, I'm not optimistic about the alignment of cyberweapons, but optimism about them not being too general seems more warranted. They would be another case of people wanting results NOW, ie hacking together existing techniques.
2ChristianKl
Apart from groups whose purpose is attacking, the security teams at the FANG companies are likely also capable of attacking if they wanted and employ some of the most capable individuals. We need a debate about what's okay for a Google security person to do in their 20% time. Is it okay to join the conflict and defend Ukrainian cyber assets? Is it okay to hack Russian targets in the process? Should the FANG companies explicitly order their employees to keep out of the conflict?

I think that the authors at least did some amount of work to distinguish the eras, but agree more work could be done.

Also I agree w/ Stella here that Turing, GPT-J, GShard, and Switch are probably better fit into the “large scale“ era.

A RayΩ220

I with more of the language alignment research folks were looking into how current proposals for aligning transformers end up working on S4 models.

(I am one of said folks so maybe hypocritical to not work on it)

In particular it seems like there's way in which it would be more interpretable than transformers:

  • adjustable timescale stepping (either sub-stepping, or super-stepping time)
  • approximately separable state spaces/dynamics -- this one is crazy conjecture -- it seems like it should be possible to force the state space and dynamics into separate groups, i
... (read more)

I work on this sort of thing at OpenAI.

I think alignment datasets are a very useful part of a portfolio approach to alignment research.  Right now I think there are alignment risks/concerns for which datasets like this wouldn't help, but also there are some that it would help for.

Datasets and benchmarks more broadly are useful for forecasting progress, but this assumes smooth/continuous progress (in general a good assumption -- but also good to be wary of cases where this isn't the case).

Some thoughts from working on generating datasets for research, ... (read more)

I worry a little bit about this == techniques which let you hide circuits in neural networks.  These "hiding techniques" are a riposte to techniques based on modularity or clusterability -- techniques that explore naturally emergent patterns.[1]  In a world where we use alignment techniques that rely on internal circuitry being naturally modular, trojan horse networks can avoid various alignment techniques.

I expect this to happen by default for a bunch of reasons.  An easy one to point to is the "free software" + "crypto anarchist" + "fuck y... (read more)

Just copy-pasting the section

We believe that Transformative Artificial Intelligence (TAI) [Karnofsky et al., 2016] is approaching [Cotra, 2020, Grace et al., 2018], and that these systems will cause catastrophic damage if they are misaligned with human values [Fox and Shulman, 2013, Omohundro, 2008]. As such, we believe it is essential to prioritize and help facilitate technical research that ensures TAI’s values will be aligned with ours. 

AI Alignment generally refers to the problem of how to ensure increasingly powerful and autonomous AI systems per

... (read more)
5leogao
Note: there are also several further subsections that dive into much further detail into these points; the quoted section here is the intro to those.
A RayΩ7210

It's worth probably going through the current deep learning theories that propose parts of gears-level models, and see how they fit with this.  The first one that comes to mind is the Lottery Ticket Hypothesis.  It seems intuitive to me that certain tasks correspond to some "tickets" that are harder to find.

I like the taxonomy in the Viering and Loog, and it links to a bunch of other interesting approaches.

This paper shows phase transitions in data quality as opposed to data size, which is an angle I hadn't considered before.

There's the google pa... (read more)

A RayΩ110

Decomposing Negotiating Value Alignment between multiple agents

Let's say we want two agents to come to agreement on living with each other.  This seems pretty complex to specify; they agree to take each other's values into account (somewhat), not destroy each other (with some level of confidence), etc.

Neither initially has total dominance over the other.  (This implies that neither is corrigible to the other)

A good first step for these agents is to share each's values with the other.  While this could be intractably complex -- it's probably ... (read more)

2Dagon
I think there are LOTS of examples of organisms who cooperate or cohabitate without any level of ontology or conscious valuation.  Even in humans, a whole lot of the negotiation is not legible.  The spoken/written part is mostly signaling and lies, with a small amount of codifying behavioral expectations at a very coarse grain.
1TLW
This is a strong assertion that I do not believe is justified. If you are an agent with this view, then I can take advantage by sending you an altered version of my values such that the altered version's Nash equilibrium (or plural) are all in my favor compared to the Nash equilibria of the original game. (You can mitigate this to an extent by requiring that both parties precommit to their values... in which case I predict what your values will be and use this instead, committing to a version of my values altered according to said prediction. Not perfect, but still arguably better.) (Of course, this has other issues if the other agent is also doing this...)
A RayΩ470

I'm really excited about this research direction. It seems so well-fit to what you've been researching in that past -- so much so that it doesn't seem to be a new research direction so much as a clarification of the direction you were already pursuing.

I think producing a mostly-coherent and somewhat-nuanced generalized theory of alignment would be incredibly valuable to me (and I would consider myself someone working on prosaic alignment strategies).

A common thread in the last year of my work on alignment is something like "How can I be an aligned intellig... (read more)

3Alex Flint
Yeah I agree! It seems that AI alignment is not really something that any existing disciplines is well set up to study. The existing disciplines that study human values are generally very far away from engineering, and the existing disciplines that have an engineering mindset tend to be very far away from directly studying human values. If we merely created a new "subject area" that studies human values + engineering under the standard paradigm of academic STEM, or social science, or philosophy, I don't think it would go well. It seems like a new discipline/paradigm is innovation at a deeper level of reality. (I understand adamShimi's work to be figuring out what this new discipline/paradigm really is.) Interesting! I hadn't thought of habit formation as relating to acausal decision theory. I see the analogy to making trades across time/contexts with yourself but I have the sense that you're referring to something quite different to ordinary trades across time that we would make e.g. with other people. Is the thing you're seeing something like when we're executing a habit we kind of have no space/time left over to be trading with other parts of ourselves, so we just "do the thing such that, if the other parts of ourselves knew we would do that and responded in kind, would lead to overall harmony" ? We could definitely study proxies in detail. We could look at all the market/government/company failures that we can get data on and try to pinpoint what exactly folks were trying to align the intelligent system with, what operationalization was used, and how exactly that failed. I think this could be useful beyond merely cataloging failures as a cautionary tale -- I think it could really give us insight into the nature of intelligent systems. We may also find some modest successes! Hope you are well Alex!

Oh great, thanks.  I think I was just asking folks if they thought it should be discussed separately (since it is a different piece) or together with this one (since they're describing the same research).

2Algon
I don't think it is worth spinning up a new article on it, but I'm a karma-junkie so my opinion is a little biased. More seriously, he doesn't say much that the blog post doesn't, or that you can't see by going on the link to examples of AlphaCode's code. But it is probably better presented.

Should this other post be a separate linkpost for this? https://www.furidamu.org/blog/2022/02/02/competitive-programming-with-alphacode/#fnref:2

Feels like it covers the same, but is a personal description by an author, rather than the deepmind presser.

2Algon
Thanks for mentioning that. I didn't know it existed. But I'm not sure what you mean by "a seperate linkpost"? Anyway, I'll link to it at the beginning of the post.
A RayΩ110

I think that's right that upgraded verification by itself is insufficient for 'defense wins' worlds.  I guess I'd thought that was apparent but you're right it's definitely worth saying explicitly.

A big wish of mine is that we end up doing more planning/thinking-things-through for how researchers working on AI today could contribute to 'defense wins' progress.

My implicit other take here that wasn't said out loud is that I don't really know of other pathways where good theorem proving translates to better AI x-risk outcomes.  I'd be eager to know of these.

Thoughts:

First, it seems worthwhile to try taboo-ing the word 'deception' and see whether the process of building precision to re-define it clears up some of the confusion.  In particular, it seems like there's some implicit theory-of-mind stuff going on in the post and in some of the comments.  I'm interested if you think the concept of 'deception' in this post only holds when there is implicit theory-of-mind going on, or otherwise.

As a thought experiment for a non-theory-of-mind example, let's say the daemon doesn't really understand why it get... (read more)

2Charlie Steiner
I think there's also some theory of mind used by us outside observers when we label some events "deception." We're finding it easy to predict what the "evil demon" will do by using a simple model that treats it as an agent trying to affect the beliefs of another agent. If the stuff in the environment didn't reward the intentional stance like this, we wouldn't see the things as agents, let alone see them as doing deception.
A RayΩ340

Copying some brief thoughts on what I think about working on automated theorem proving relating to working on aligned AGI:

  • I think a pure-mathematical theorem prover is more likely to be beneficial and less likely to be catastrophic than STEM-AI / PASTA
  • I think it's correspondingly going to be less useful
  • I'm optimistic that it could be used to upgrade formal software verification and cryptographic algorithm verification
  • With this, i think you can tell a story about how development in better formal theorem provers can help make information security a "defense
... (read more)
1[comment deleted]
2Ramana Kumar
In my understanding there's a missing step between upgraded verification (of software, algorithms, designs) and a "defence wins" world: what the specifications for these proofs need to be isn't a purely mathematical thing. The missing step is how to figure out what the specs should say. Better theorem proving isn't going to help much with the hard parts of that.

FWIW I think this is basically right in pointing out that there's a bunch of errors in reasoning when people claim a large deep neural network "knows" something or that it "doesn't know" something.

I think this exhibits another issue, though, by strongly changing the contextual prefix, you've confounded it in a bunch of ways that are worth explicitly pointing out:

  • Longer contexts use more compute to generate the same size answer, since they they attend over more tokens of input (and it's reasonable to think that in some cases that more compute -> better a
... (read more)

Quite a lot of scams involve money that is fake.  This seems like another reasonable conclusion.

Like, every time I simulate myself in this sort of experience, almost all of the prior is dominated by "you're lying".

I have spent an unreasonable (and yet unsuccessful) amount of time trying to sketch out how to present omega-like simulations to my friends.

3Pattern
That seems reasonable - I don't think such predictions are that feasible.

Giving Newcomb's Problem to Infosec Nerds

Newcomb-like problems are pretty common thought experiments here, but I haven't seen a bunch of my favorite reactions I've got when discussing it in person with people.  Here's a disorganized collection:

  • I don't believe you can simulate me ("seems reasonable, what would convince you?") -- <describes an elaborate series of expensive to simulate experiments>.  This never ended in them picking one box or two, just designing ever more elaborate and hard to simulate scenarios involving things like predicti
... (read more)
1TLW
For reference, my response would generally be a combination of these, but for somewhat different reasons. Namely: parity[1] of the first bitcoin block mined at least 2 minutes[2] after the question was asked decides whether to 2box or 1box[3]. Why? A combination of a few things: 1. It's checkable after the fact. 2. Memorizing enough details to check it after the fact is fairly doable. 3. A fake-Omega cannot really e.g. just selectively choose when to ask the question. 4. It's relatively immutable. 5. It pulls in sources of randomness from all over. 6. It's difficult to spoof without either a) being detectable or b) presenting abilities that rule out most 'mundane' explanations. 1. Sure, a fake-Omega could, for instance, mine the next block themselves 1. ...but either a) the fake-Omega has broken SHA, in which case yikes, or b) the fake-Omega has a significant amount of computational resources available. 1. ^ Yes, something like parity of a different secure hash (or e.g. an HMAC, etc) of the block could be better, as e.g. someone could have built a miner that nondeterministicly fails to properly calculate a hash depending on how many ones are in the result, but meh. This is simple and good enough I think. 2. ^ (Or rather, long enough that any blocks already mined have had a chance to propagate.) 3. ^ In this case https://blockexplorer.one/bitcoin/mainnet/blockId/720944 , which has a hash of ...a914ff87, hence odd, hence 1box.
2Pattern
The scam might make more sense if the money is fake.

Adding a comment instead of another top-level post saying basically the same thing.  Add my thoughts, on things I liked about this plan:

It's centered on people.  A lot of rationality is thinking and deciding and weighing and valuing possible actions.  Another frame that is occasionally good (for me at least) is "How would <my hero> act?" -- and this can help guide my actions.  It's nice to have a human or historical action to think about instead of just a vague virtue or principle.

It encourages looking through history for events o... (read more)

Ideas myself and others have had, off the top of my head;

  • Atul Gawande (checklist manifesto, rearranged how hospitals organize themselves to reduce medical error)
  • Moon landing
  • Invention/inventors of oral rehydration therapy
  • Gutenberg
  • Galileo
  • Newton
  • Darwin
  • Long list of people who sheltered victims from the Holocaust (plus that one German who bought 250k Nanking residents time to escape)
  • Norman Borlaug
  • A general abundance day celebrating all modernity has brought us (Ray suggests summer solstice for this, which seems right)
  • Polio vaccine guy
  • Actually, there are a lot of
... (read more)
Load More