All of algon33's Comments + Replies

I am kind of suprised you didn't reference causal inference here to just gesture at the task in which we "figure out which variables are directly relevant - i.e. which variables mediate the influence of everything else". Are you pointing to a different sort of idea/do you not feel causal inference is adequate for describing this task?

Also, scenario 1 and 2 seem fairly close to the "linear" and "non-linear" models of innovation Jason Crawford described in his talk "The Non-Linear Model of Innovation." To be honest, I prefered his description of the models. ... (read more)

2johnswentworth
Causal inference (or more precisely learning causal structure) is exactly the sort of thing I have in mind here. There's actually a few places in the post where I should distinguish between variables which control an outcome in an information sense (i.e. sufficient to perfectly predict the outcome) vs in a causal sense (i.e. sufficient to cause the outcome under interventions). The main reason I didn't talk about it directly is because I would have had to explain that distinction, and decided that would be too much of a distraction from the main point. I think the takeaway of Jason's talk, as it relates to this post, is that a large chunk of the "science" of achieving consistent outcomes happens in inventors' workshops rather than scientists' labs. The problem is still largely similar, regardless of the label applied, but scientists aren't the only ones doing science.
Answer by algon3350

The strategy of conflict is condensed instrumental rationality. Much of the content is covered elsewhere, but I don't know of a superior qualatative presentation.

Talking about qualatative presentations, thinking physics is a set of hundreds of physics problems, designed to show how important conservation laws and infinitesimals are. The problems are all solvable with some careful thought, and cover quite a deal of ground. I wish more books were written in this way. 

1Panashe Fundira
FYI this link doesn't go anywhere Here's a link to the book's Goodreads page

Here's a visualisation that goes along with Euclid's elements

A plot of which theorems are used in the proof of each theorem in Euclid's elements, ordered by the book e.g. the black dots at the bottom say proofs in the book 13 mostly used theorems from all the books bar 7,8,9 and 12. 

This was one of many from an article on "The Empirical MetaMathematics of Euclid and Beyond". It is a long essay on the overarching structure of Euclid's elements and verifies some claims made about Euclid's Elements e.g. the proofs were ordered in nearly the most parsimon... (read more)

Definetely not a subject, but I'd say that the visualisation of Wolfram's theory of everything is excellent. Of course there are problems with his theory of everything, like the fact that he hasn't actually proved his claims that it generates GR field equations or replicates QM. Or shown that his theory evades the critical objection Scott Aaronson raised. but as a visualisation:

  1. It is aesthetically pleasing
  2. Compactly contains the basic ideas of his T.o.E. 
  3. Ties the basic concepts together to see how they could generate a theory of physics

So I'd still rec... (read more)

I am glad you put the quotation marks around "morality as taxes" since what my mind jumped to upon verbalising the title was what you described in the last part of your post: something you'd be glad to evade where possible. In retrospect, its clear that the quotation marks were meant to point to another approach and not the one your thought experiment is meant to represent. Still, I think "Wholehearted choices vs morality as taxes" would be a little clearer as a title. 

Answer by algon3310

Go short on Uber?

My personal reasons:

  1. I assumed the question was about the first few decades after "first contact".
  2. A large chunk of my probability mass is on first contact being unintentional, and something neither side can do much about. Or perhaps one "side" is unaware of it. Like if we receive some message directed to no one in particular, or recording the remnants of some extreme cosmic event that seems mighty unatural. 
  3. It feels like we're near certain to have created an AGI by then. I am unsure enough about the long term time scales of AGI improvement, and their l
... (read more)

You should alter questions to make it clear "we" is meant to be humans or whatever we makes that succeeds us.

Also, perhaps a queston on whether "first contact" will be us detecting them without their being aware of it.

Minor quibble which I hope isn't breaking a norm: BetFair did seem to pay out last week, or at least some of the bets on who would win the presidency were settled on 07/11/20. 

Do you expect we'll be n the midst of a third wave before the vaccine begins to be doled out? Or just beginning to enter one? 

Thanks for the post.

4Zvi
I think we are in the third wave now, and that seems unambiguous. Unless there are delays it will not be over when th vaccine is being rolled out.
3SimonM
As far as I can tell the exchange settled:  - Nominee markets  - State markets, excluding (AZ, GA, MI, NV, PA, WI) They haven't settled the big market (Next President) nor any of the key states which are "contested" (for whatever definition of contested you're interested in) (or any of the vote percentage, EC handicap markets etc). It's possible the sportsbook settled the presidential market, but that's a very different beast to the exchange. (Lots of "normal" bookies have already settled the presidential market)  
Answer by algon3350

 Use a fire lighter

  1. Use a matchstick
  2. Use a magnifying glass on the candle wick
  3. Use a very large magnifying glass on the candle itself, igniting the wax
  4. Chuck the candle into a very hot oven
  5. Use a laser to ignite it, perhaps getting one from a CD scanner and overloading it
  6. Run a stupendously large electric current through the candle wax or the wick.
  7. Go into a volcano
  8. Launch the candle into the sun in one of 50 ways
  9. Build a simple bomb, perhaps a flour bomb, and use it to "ignite" the candle momentarily.
  10. Grab some wood sticks, dry them out for a couple months, tu
... (read more)
3Slider
34. Naming the concept made me seriously chuckle Hope you get reinterested to get to full scope

The post about Sweden's unusual situation you linked to has updated. The author claims that the reduced death rate is mostly due to younger people getting near all of the covid cases, which is supported by recent data (the figure shows total number of changes between July and 03 Nov). Why that is the case is another issue. 

 

Edit: As always, thanks for the post. 

What about Thorium? A back of the envelope calculation suggests thorium reactors could supply us with energy 100-500 years. I got this from a few sources. First used the figure of the 170 GW days produced per metric tonne of fuel (Fort St Vrain HTR) and the availability of fuel (500-2500 ktonnes according to Wikipedia) to estimate 10-50 years out of Thorium reactors if we keep using 15TW of energy. And that's not even accounting for breeding reactors, which can produce their own fuel. So if we do go with the theoretical maximum, then we should multiply thi... (read more)

algon33*10

Thanks for the reply. Feelings of helplessness sounds about right, and I think you may be right about giving your self the feeling that you are being supported. Only, people with severe chronic pain often suffer from anxiety and depression as well. It seems like it would be a hard battle getting their brains to recognise those aforementioned feelings. 

2Kaj_Sotala
It can definitely be very difficult, yeah.

How does this apply for physically painful trauma? I understand that the broader process should work, but I'm curious if you could guess what frame would be the most helpful for such trauma.

2Kaj_Sotala
Hmm, interestingly I don't feel like any physically painful experiences have given me significant trauma, even though I've had broken bones a few times etc. I think this is because I've generally felt socially supported during those times, and confident that the experiences will eventually pass: my impression is that a feeling of helplessness plays a big role in whether a physically painful experience gets interpreted as traumatic or not. So in principle giving your past self affection and a feeling of being safe and supported could also help with that. At least that would be my guess based on my limited experience.

Somewhat urgent: can anyone recommend a good therapist or psychiatrist for anxiety/depression in the UK? Virtual sessions are probably required. Private is fine. Also, they shouldn't be someone biased towards rationalist types. The person I'm thinking of has nearly no knowledge of these ideas.

Other recommendations that seem relevant welcome.

I still disagree. You can use Fermat's last theorem rigorously without understanding why it works. Same for the four colour theorem. And which mathematics understand why we can classify finite simple groups the way we do? I'd bet fewer than a percent do. Little wonder, if the proof's 3 volumes long! My point is that there are many theorems a mathematician will use without rigorously knowing why it works. Oh sure, you can tell them a rough story outlining the ideas. But could the prove it themselves? Probably not, without a deep understanding of the area. Y... (read more)

1vlad.proex
To stay on computer science analogies, this reminds me of the principle of abstraction. When you call an API, it sort of feels like magic. A task gets done, and you trust that it was done correctly, and that saves you the time of controlling the code and rewriting it from scratch. "We have only to think out how this is to be done once, and forget then how it is done." (A. Turing, 1947). 
2Richard_Kennaway
You can understand what these theorems say without knowing how they were proved. But non-standard analysis requires a substantial amount of extra knowledge to even understand the transfer principle. In contrast, epsilon-delta requires no such sophistication.

Note that I didn't say it's not an aesthetic preference. I just don't think likely to be false --> ugly, though I agree learnings its likely to be false-->uglier than before.

No, to understand why the transfer principle works requires a fair amount of knowledge of mathematical logic. It doesn't follow that you can't perform rigorous proofs once you've accepted it. Or am I missing something here?

2Richard_Kennaway
If you don't understand why the transfer principle works, you would just be accepting it as magic. This is not rigorous.

Because of the shift in culture in mathematics, wherein the old proofs were considered unrigorous. Analysis ala Weirstrauss put the old statements on firmer footing, everyone migrated there, and infinitesimals were left to langiush until a transfer principle  was proven to give them a rigorous founding. But by that time, standard analysis had born such great fruits that it was deeply intertwined with modern mathematics. And of course, there's been a trend against the infinitary and against the incomputable in the past century.

So there's both instituti... (read more)

3ChristianKl
What do you consider "being fine with infinitary mathematics" is it's not an aesthetic preference? (and thus the word ugly would apply) 
2Richard_Kennaway
Also, to use infinitesimals rigorously takes a fair amount of knowledge of mathematical logic, otherwise what works and what does not is just magic. Epsilon-delta proofs do not need any magic, nor any more logic than that needed to contend with mathematics at all.

Stuart, by " is complex" are you referring to their using  as the estimated reward function?

Also, what did you think of their arguement that their agents have no incentive to manipulate their beliefs because they evaluate future trajectories based of their current beliefs about how likely they are? Does that suffice to implement eq. 1) from your motivated value selection paper?

2Stuart_Armstrong
I mean that that defining Prt can be done in many different ways, and hence has a lot of contingent structure. In contrast, in Plp(R∣D1:j,ρ), the $\rho is a complex distribution on R, conditional on D1:j; hence Plp itself is trivial and just encodes "apply ρ to R and D1:j in the obvious way.

Not really? The axioms (for hyperreals) aren't much different to that of the reals. Yes, its true that you need some strange constructions to justify that the algebra works as you'd expect. But many proofs in analysis become intuitive with this setup, and surely aid pedagogy. Admittedly, you need to learn standard analysis anyway since the tools are used in so many more areas. But I'd hardly call it ugly. 

2ChristianKl
So why do you think it is that math mostly doesn't get taught in a way where calculus is due to infinitively small numbers?

Recall that memories are pathway dependant i.e. you can remember an "idea" when given verbal cues but not visual ones. Or given cues in the form of "complete this sentence " and "complete this totally different sentence expressing the same concept". If you memorise a sentence and can recall it any relevant context, I'd say you've basically learnt it. But just putting it into SRS on its own won't do that. Like, that's why supermemo has such a long list of rules and heuristics on how to use SRS effectively.

1Raj Thimmiah
Agree on this, memory coherence is pretty important. Cramming leads to results sort of like how you can't combine the trig you learned in highschool with some physics knowledge: there aren't good connections between the subjects, leaving them relatively siloed. It requires both effort and actually wanting to learn a thing for the thing to integrate well. We tend to forget easily the things we don't care about (see school knowledge).

Thanks. Two questions:

Do the staff and faculty have a similair diversity of opinions?

Is messaging chai-info@berkeley.edu in orde to contact your peers the right procedure here?

5Rohin Shah
Hmm, of the faculty Stuart spends the most time thinking about AI alignment, I'm not sure how much the other faculty have thought about corrigibility -- they'll have views about the off switch game, but not about MIRI-style corrigibility. Most of the staff doesn't work on technical research, so they probably won't have strong opinions. Exceptions: Critch and Karthika (though I don't think Karthika has engaged much with corrigibility). Probably the best way is to find emails of individual researchers online and email them directly. I've also left a message on our Slack linking to this discussion.

Here's the re-written version, and thanks for the feedback.

Having an Anki deck is kind of useless in my view. When you encounter a new idea, you need to integrate it with the rest of your thoughts. For a technique, you must integrate it with your unconscious. But often, there's a tendency to just go "oh, that's useful" and do nothing with it. Putting it into space repitition software to view later won't accomplish anything since you're basically memorising the teacher's password. Now suppose you take the idea, th... (read more)

2Raj Thimmiah
Haha, thanks for the rewrite, makes much more sense now. tradeoff cognitive buck Completely agree: too easy to cram mindlessly with Anki, I think in large part because of how much work it takes to make cards yourself. I'm a bit skeptical of the drilling idea because cards taking more than 5 seconds to complete tend to become leeches and aren't the kind of thing you could do long-term, especially with Anki's algorithm. Still worth trying though, would be interested to hear if you or anyone else you know has gotten much benefit from it. With the thoroughness vs. designer complexity, I think all the options with Anki kind of suck (mainly because I don't think they would work for my level of conscientiousness, at least). If end users make their own cards, they'll give up (or at least most people would, I think. It's not very fun making cards from scratch). If you design something for end users (possibly with some of the commoncog tacit knowledge stuff) I think it's sort of beneficial but you wouldn't get same coherence boost as making stuff yourself. Too easy to learn cards but not actually integrate them, usably. It also seems like a pain to make. For declarative knowledge, I think the best balance for learning is curating content really well for incremental reading alongside (very importantly) either coaching* or more material on meta-skills of knowledge selection to prevent people from FOMO memorizing everything. I think with SuperMemo it wouldn't be hard to make a collection of good material for people to go through in a sane, inferential distance order. Still a fair bit of work for makers but not hellish. I'm very, very, very curious about the tacit knowledge stuff. I still haven't gotten through all of the commoncog articles on tacit knowledge, though I've been going through them for a while, but in terms of instrumental rationality they seem very pragmatic. (I particularly enjoyed his criticism of rationalists in Chinese Businessmen: Superstition Doesn'
Answer by algon3330

Having an Anki deck is kind of useless in my view as engaging with the ideas is not the path of least resistance. There's a tendency to just go "oh, that's useful" and do nothing with it because Anki/Supermemo are about memorisation. Using them for learning, or creating, is possible with the right mental habits. But for an irrational person, that's exactly what you want to instill! No, you need a system which fundamentally encourages those good habits.

Which is why I'm bearish about including cards that tell you to drill certa... (read more)

1MikkW
I think one crux between us is the degree to which "memory is the foundation of cognition", as Michael Nielsen once put it. Coming from the perspective that this is true, it seems to me that a natural consequence of a person memorizing even a simple sentence, and maintaining that memory with SRS, is that the sentence needs to be compressed in the mind to ensure that it has high stability, and can be recalled even after having not been used for many months, or even years. In order to achieve this compression, it is inevitable that the ideas represented by the sentence will become internalized and integrated deeply with other parts of the mind, which is exactly what is desired. This process is a fundamental part of how the human mind works, and applies even in the mind of a person with low "rationality".
2Raj Thimmiah
Could you rewrite some of the first paragraph? I read it 2-3 times and was still kind of confused. Funny you linked commoncog, was about to link that too. Great blog.

Active IRD doesn't have anything to do with corrigibility, I guess my mind just switched off when I was writing that. Anyway, how diverse are CHAI's views on corrigibility? Could you tell me who I should talk to? Because I've already read all the published stuff on it if I'm understanding you rightly and I want to make sure that all the perspectives no this topic are covered.

2Rohin Shah
Hmm, I expect each grad student will have a slightly different perspective, but off the top of my head I think Michael Dennis has the most opinions on it. (Other people could include Daniel Filan and Adam Gleave.)

Hey Rohin, I'm writing a review on everything that' been written on corrigibility so far. Do the "the off switch game", "Active Inverse Reward Design" "should robots be obedient", "incorrigibility in CIRL" as well as your reply in the Newsletter represent CHAI's current views on the subject? If not, which papers contain them?

2Rohin Shah
Uh, I don't speak for CHAI, and my views differ pretty significantly from e.g. Dylan's or Stuart's on several topics. (And other grad students differ even more.) But those seem like reasonable CHAI papers to look at (though I'm not sure how Active IRD relates to corrigibility). Chapter 3 of the Value Learning sequence has some of my takes on reward uncertainty, which probably includes some thoughts about corrigibility somewhere. Human Compatible also talks about corrigibility iirc, though I think the discussion is pretty similar to the one in the off switch game?
IIRC, this also shows a discontinuous flip at the bottom followed by slower change.

Maybe edit the post so you include this? I know I was wondering about this too.

2nostalgebraist
Post has been now updated with a long-ish addendum about this topic.
2nostalgebraist
Good idea, I'll do that. I know I'd run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it's smaller. Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like "plasma" are "kept around" for later use.

Before or after what? If it is a passage in a book, or an article you wrote, I agree that's enough. But what about a nebulous concept you struggled to put into words? Or an idea which seemed to have suprising links to other thoughts, which you didn't pursue at the time. If you write all this stuff down explicitly, then fine. If not, and you're writing style is like mine, then it seems better to link to other cards and leave it to your future self to figure it out.

Plus, links provide the system extra information with which it can auto-suggest other relevant ideas that you weren't even aware you were considering.

I started writing a blog post in response, but that seems a bit much for a comment. Suffice to say, I agree that anti-spaced repetition is a good idea. However, it throws away the context of the notes you made, as well as showing it to you after your mind has totally forgotten about it. And as I wrote, those seem to be major factors in the value of the Zettlekasten method!

2gwern
Why can't any individual 'item' be shown with context like a dozen lines before/after (eg fading out)?

Yeah, I had some ideas concerning how to keep track of Zettlekasten as well as the right way to display graphs. Reinforcing the network is definetely a worthwhile idea. The entire point is to suggest good links, but also give you the freedom to traverse your graph. RE the hyperlinks: I agree about the worry of biases. But more than that, it seems the network should not automate link suggestion without leaving the option to create links yourself. As you say, the worth of the Zettlekasten method is largely in instilling virtuous mental hanits. What you suggested seems like it could instil laziness in the user.

Do you do this in a piecemeal way, or do you assign a few days to re-organising your thoughts when you learn some important new principle?

Epistemic status: unsure

I have a hypothesis about why Zettlekasten provide diminishing returns over time. A corrolary is that others should find even less value in your Zettles. Which ties into some of your points, and shows what is missing from the Zibbaldone. Plus there are some suggestions on how to correct the flaw.

One of the key benefits to the Zettlekasten is that the way you link cards reflects your psyche's understanding of the ideas. Of course other note-taking systems have this advantage. But this isn't baked into them like it is with Z... (read more)

2romeostevensit
You have to do aggressive recompression of your past stuff. This takes time but pays off in efficiency gains in your internal representations IME.
2[comment deleted]
5gwern
I call this "anti-spaced repetition": the benefit is from surfacing connections for material you've forgotten (as opposed to reviewing material you still remember so as to strengthen retention). You can optimize time spent reviewing older material by using the spacing effect to estimate which things have been forgotten for the longest - same equation, just optimizing for something else.
1Randomini
That'd be an interesting structure for sure. Some kind of spaced repetition, presenting you with two (or three? or more?) prior thoughts to read simultaneously, to reinforce not just the information, but the relationship between the different ideas; not just in isolation, but reinforcing the network itself... maybe with some kind of highlight markup to indicate where the parallel is strongest. With regards to the Zettelkasten containing too many cards to keep up with, I think card agglomeration above a certain threshold might be useful. Rather than hyperlinking between 500 cards with one sentence each on them, it may be preferable to link to particular sentences or sections within 50 different cards. The hyper-deconstruction of the original index card system was I think more a limitation of index card size, and how you might best utilize a physical system. Hypertext and NLP could identify links, and keeping a human in the loop ensures they're the kind of parallels you actually want to be forming... to some degree. Being presented with particular links still might still have some subtle undesirable biases. But the prospect of "combine these cards?" might make things more manageable. I think a manual consolidation step would at least prompt for better structure for future consumption. If you know the order you'll always want to view the notes in, turn them into the one note with headings. That could keep the knowledge bank a bit more manageable and navigable. I don't think the current graph view in systems like Roam or Obsidian is enough - you need a text preview of those notes. Maybe for each new note, checking the graph structure remains planar, avoiding messy crossovers in visualisations of it. Or maybe that's a terrible idea, I have no idea! There's lots of low hanging fruit in the space, and whoever makes the biggest fruit basket's going to win here.
Answer by algon3360

When I bother to vote, I do take TK into account when upvoting. Karma serves a signalling purpose. But only when abs(TK) is large. If I see a post with +50 karma, I would have quite high expectations of it. If it exceeds that expectation, and I remember voting is a thing, I will upvote it. Since I almost never downvote, I can't say how much TK affects that.

How does this relate to the whole "no-self" thing? Is the character becoming aware of the player there?

2Kaj_Sotala
Good question. I think that at least some approaches to no-self do break down the mechanisms by which the appearance of a character is maintained, but the extent to which it actually gives insight to the nature of the player (as opposed to giving insight to the non-existence of the character) is unclear to me.
If there is a mistake deep in the belief of someone

Are they not ideal Bayesians? Also, do they update based off other people's priors? It could be intresting to make them all ultra-finitists.

Mimemis land is confusing from the outside. I'm not sure how they could avoid stumbling upon "correct" forms of manipulating beliefs, if they persist for long enough and there are large enough stochastic shocks to the communities beliefs. If they also copid succesful people in the past, I feel like this would be even more likely. Unless they happ... (read more)

4ozziegooen
Happy to hear you enjoyed it! On First Principles Land: Even if they are ideal Bayesians, they could come to a mistake with unfortunate evidence. I'm not sure how we should handle updating on the information of others, that complicates things significantly. I was mostly imagining this as each person independently acts as a semi-ideal Bayesian agent and knows everything from the fundamental truths and evidence themselves. I would be interested in variations with various kinds of knowledge sharing. On Mimesis Land:  Yea, this land is confusing to me too. I guess belief manipulation would essentially act as an evolutionary process. Some clusters would learn some techniques for belief selection, and the successful clusters would pass on these belief-selection techniques. That said, this would take a while, and a most people could be oblivious to this.
Dutch custom prevents me from recommending my own recent paper in any case

This phrase and its implications are perfect examples of problems in corrigibility. Was that intentional? If so, bravo. Your paper looks interesting, but I think I'll read the blog post first. I want a break from reading heavy papers. I wonder if the researchers would be OK with my drawing on their blog posts in the review. Would you mind?

Thanks for recommending "Reward tampering", it is much appreciated. I'll get on it after synthesising what I've read so far. Otherwise, I don't think I'll learn much.

1Koen.Holtman
Nope, not intentional. You should feel free to write a literature overview that cites or draws heavily on paper-announcement blog posts. I definitely won't mind. In general, the blog posts tend to use language that is less mathematical and more targeted at a non-specialist audience. So if you aim to write a literature overview that is as readable as possible for a general audience, then drawing on phrases from the author's blog posts describing the papers (when such posts are available) may be your best bet.

Hey, thanks for writing all of that. My current goal is to do an up to date literature review on corrigibility, so that was a most helpful comment. I'll definitely look over your blog, since some of these papers are quite dense. Out of the paper's you recommended, is there one that stands out? Bear in mind that I've read Stewart and MIRI's papers already.

2Koen.Holtman
Thanks, you are welcome! Dutch custom prevents me from recommending my own recent paper in any case, so I had to recommend one paper from the time frame 2015-2020 that you probably have not read yet, I'd recommend 'Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective'. This stands out as an overview of different approaches, and I think you can get a good feeling of the state of the field out of it even if you do not try to decode all the math. Note that there are definitely some worthwhile corrigibility related topics that are discussed only/mainly in blog posts and in LW comment threads, but not in any of the papers I mention above or in my mid-2019 related work section. For example, there is the open question whether Christiano's Iterated Amplification approach will produce a kind of corrigibility as an emergent property of the system, and if so what kind, and is this the kind we want, etc. I have not seen any discussion of this in the 'formal literature', if we define the formal literature as conference/arxiv papers, but there is a lot of discussion of this in blog posts and comment threads.
2Koen.Holtman
Cross-linking to another thread: I just posted a long comment with more references to corrigibility resources in your post asking about corrigibility reading lists. In that comment I focus on corrigibility related work that has appeared as scientific papers and/or arxiv preprints.
algon33Ω010

This post deserves a strong upvote. Since you've done the review, would you mind answering a reference request? What papers/blog posts represent Paul's current views on corrigibility?

2Chi Nguyen
Thanks for the comment and I'm glad you like the post :) On the other topic: I'm sorry, I'm afraid I can't be very helpful here. I'd be somewhat surprised if I'd have had a good answer to this a year ago and certainly don't have one now. Some cop-out answers: * I often found reading his (discussions with others in) comments/remarks about corrigibility in posts focused on something else more useful to find out if his thinking changed on this than his blog posts that were obviously concentrating on corrigibility * You might have some luck reading through some of his newer blogposts and seeing if you can spot some mentions there * In case this was about "his current views" as opposed to "the views I tried to represent here which are one year old": The comments he left are from this summer, so you can get some idea from there/maybe assume that he endorses the parts I wrote that he didn't commented on (at least in the first third of the doc or so when he still left comments) FWIW, I just had through my docs and found "resources" doc with the following links under corrigiblity: Clarifying AI alignment Can corrigibility be learned safely? Problems with amplification/distillation The limits of corrigibility Addressing three problems with counterfactual corrigibility Corrigibility Not vouching for any of those being the up-to-date or most relevant ones. I'm pretty sure I made this list early on in the process and it probably doesn't represent what I considered the latest Paul-view.

Based off what you've said in the comments, I'm guessing you'd say the various forms of corrigibility are natural abstractions. Would you say we can use the strategy you outline here to get "corrigibility by default"?

Regarding iterations, the common objection is that we're introducing optimisation pressure. So we should expect the usual alignment issues anyway. Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?

2johnswentworth
I'm not sure about whether corrigibility is a natural abstraction. It's at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions. Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system's true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.

This came out of the discussion you had with John Maxwell, right? Does he think this is a good presentation of his proposal?

How do we know that the unsupervised learner won't have learnt a large number of other embeddings closer to the proxy? If it has, then why should we expect human values to do well?

Some rough thoughts on the data type issue. Depending on what types the unsupervised learner provides the supervised, it may not be able to reach the proxy type by virtue of issues with NN learning processes.

Recall that tata types can be viewed as hom... (read more)

4John_Maxwell
I'm very glad johnswentworth wrote this, but there are a lot of little details where we seem to disagree--see my other comments in this thread. There are also a few key parts of my proposal not discussed in this post, such as active learning and using an ensemble to fight Goodharting and be more failure-tolerant. I don't think there's going to be a single natural abstraction for "human values" like johnswentworth seems to imply with this post, but I also think that's a solvable problem. (previous discussion for reference)
4johnswentworth
Sort of? That was one significant factor which made me write it up now, and there's definitely a lot of overlap. But this isn't intended as a response/continuation to that discussion, it's a standalone piece, and I don't think I specifically address his thoughts from that conversation. A lot of the material is ideas from the abstraction project which I've been meaning to write up for a while, as well as material from discussions with Rohin that I've been meaning to write up for a while. Two brief comments here. First, I claim that natural abstraction space is quite discrete (i.e. there usually aren't many concepts very close to each other), though this is nonobvious and I'm not ready to write up a full explanation of the claim yet. Second, for most proxies there probably are natural abstractions closer to the proxy, because most simple proxies are really terrible - for instance, if our proxy is "things people say are ethical on twitter", then there's probably some sort of natural abstraction involving signalling which is closer. Assuming we get the chance to iterate, this is the sort of thing which people hopefully solve by trying stuff and seeing what works. (Not that I give that a super-high chance of success, but it's not out of the question.) Strongly agree with this, and your explanation is solid. Worth mentioning that we do have some universality results for neural nets, but it's still the case that the neural net structure has implicit priors/biases which could make it hard to learn certain data structures. This is one of several reasons why I see "figuring out what sort-of-thing human values are" as one of the higher-expected-value subproblems on the theoretical side of alignment research.

Alright, here's the link for Friday: meet.google.com/qxw-zpsi-oqn

Thanks for replying.

algon33*10

Hangouts I suppose. It just works. Would next weekend be OK for you?

Edit: I've scheduled a meeting for 12pm UK time on Saturday. Tell me if that works for you.

meet.google.com/kdf-xavk-nnh

2Stuart_Armstrong
Sorry, had a few terrible few days, and missed your message. How about Friday, 12pm UK time?
algon33*30

Sometimes the cluster in the map a preference is pointing at involves another preference. Which provides a natural resolution mechanism. What happens when there's two preferences, I'm unsure. I suppose it depends on how your map changes. In which case, I think you should focus on how to make purity coherent you should start off with some "simple" map and various "simple" changes in the map. To make purity coherent relative to your map is both computationally hard, and empathetically hard.

Side-note: It would be interesting to ... (read more)

2Stuart_Armstrong
No prob. Email or Zoom/Hangouts/Skype?
Answer by algon3330

Second order logic can also arithmatise sentences, and also has fixed points. So the usual proofs carry over about the 1st incompleteness theorem. But there's an easier way to see this. There can't be any computable procedure to check if a second order sentence is valid or not, because if there was we could check if PA->Theorem and therefore decide Peano Arithmetic and therefore the Halting problem.

Answer by algon3360

You can use them for practicing techniques. Have cards which say: use X technique today. You need to actually do that rather than spend 1 minute thinking about it. Which is suprisingly hard. I suspect it works much better if you have some system to guide you in generating new ideas e.g. Zettlekasten. I suspect it could be even better if the method was incorporated into the software itself. Maybe create links between cards as well, and have some repititions where you explore the graph surrounding a card?

I'm also unsure if the spaced repition timings are optimal for drilling techniques. Does anyone know the relevant literature?

1ryan wong
Yes that does sound like a promising system for retaining useful techniques. Will try it out.