Towards_Keeperhood

I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Wikitag Contributions

Comments

Sorted by

I have a binary distinction that is a bit different from the distinction you're drawing here. (Where tbc one might still draw another distinction like you do, but this might be relevant for your thinking.) I'll make a quick try to explain it here, but not sure whether my notes will be sufficient. (Feel free to ask for further clarification. If so ideally with partial paraphrases and examples where you're unsure.)

I distinguish between objects and classes:

  • Objects are concrete individual things. E.g. "your monitor", "the meeting you had yesterday", "the German government".
  • A class is a predicate over objects. E.g. "monitor", "meeting", "government".

The relationship between classes and objects is basically like in programming. (In language we can instantiate objects from classes through indicators like "the", "a", "every", "zero", "one", ..., plural "-s" inflection, prepended posessive "'s", and perhaps a few more. Though they often only instantiates objects if it is in the subject position. In the object position some of those keywords have a bit of a different function. I'm still exploring details.)

In language semantics the sentence "Sally is a doctor." is often translated to the logic representation "doctor(Sally)", where "doctor" is a predicate and "Sally" is an object / a variable in our logic. From the perspective of a computer it might look more like adding a statement "P_1432(x_5343)" to our pool of statements believed to be true.

We can likewise say "The person is a doctor" in which case "The person" indicates some object that needs to be inferred from the context, and then we again apply the doctor predicate to the object.

The important thing here is that "doctor" and "Sally"/"the person" have different types. In formal natural language semantics "doctor" has type <e,t> and "Sally" has type "e". (For people interested in learning about semantics, I'd recommend this excellent book draft.[1])

There might still be some edge cases to my ontology here, and if you have doubts and find some I'd be interested in exploring those.

Whether there's another crisp distinction between abstract classes (like "market") and classes that are less far upstream from sensory perceptions (like "tree") is a separate question. I don't know whether there is, though my intuition would be leaning towards no.

  1. ^

    I only read chapters 5-8 so far. Will read the later ones soon. I think for the people familiar with CS the first 4 chapters can be safely skipped.

The meta problem of consciousness is about explaining why people think they are conscious.

Even if we get such a result with AIs where AIs invent a concept like consciousness from scratch, that would only tell us that they also think they have sth that we call consciousness, but not yet why they think this.

That is, unless we can somehow precisely inspect the cognitive thought processes that generated the consciousness concept in AIs, which on anything like the current paradigm we won't be.

Another way to frame it: Why would it matter that an AI invents the concept of consciousness, rather than another human? Where is the difference that lets us learn more about the hard/meta problem of consciousness in the first place?


Separately, even if we could analyze the thought processes of AIs in such a case so we would solve the meta problem of consciousness by seeing explanations of why AIs/people talk about consciousness the way they do, that doesn't mean you already have solved the meta-problem of consciousness now.

Aka just because you know it's solvable doesn't mean you're done. You haven't solved it yet. Just like the difference between knowing that general relativity exists and understanding the theory and math.

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality. 

Did you consider to instead commit to giving out retroactive funding for research progress that seems useful?

Aka that people could apply for funding for anything done from 2025, and then you can actually better evaluate how useful some research was, rather than needing to guess in advance how useful a project might be. And in a way that quite impactful results can be paid a lot, so you don't disincentivize low-chance-high-reward strategies. And so we get impact market dynamics where investors can fund projects in exchange for a share of the retroactive funding in case of success.

There are difficulties of course. Intuitively this retroactive approach seems a bit more appealing to me, but I'm basically just asking whether you considered it and if so why you didn't go with it.

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality. 

Side question: How much is Openphil funding LTFF? (And why not more?)

(I recently got an email from LTFF which suggested that they are quite funding constraint. And I'd intuitively expect LTFF to be higher impact per dollar than this, though I don't really know.)

I created an obsidian Templater template for the 5-minute version of this skill. It inserts the following list:

  • how could I have thought that faster?
    • recall - what are key takeaways/insights?
      •  
    • trace - what substeps did I do?
      •  
    • review - how could one have done it (much) faster?
      •  
      • what parts were good?
      • where did i have wasted motions? what mistakes did i make?
    • generalize lesson - how act in future?
      •  
      • what are example cases where this might be relevant?

Here's the full template so it inserts this at the right level of indentation. (You can set a shortcut for inserting this template. I use "Alt+h".)

<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length) + "- how could I have thought that faster?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + "- recall - what are key takeaways/insights?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- " %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + "- trace - what substeps did I do?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- " %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + "- review - how could one have done it (much) faster?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- " %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- what parts were good?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- where did i have wasted motions? what mistakes did i make?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + "- generalize lesson - how act in future?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- " %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- what are example cases where this might be relevant?" %>

I now want to always think of concrete examples where a lesson might become relevant in the next week/month, instead of just reading them.

As of a couple of days ago, I have a file where I save lessons from such review exercises for reviewing them periodically.

Some are in weekly review category and some in monthly review. Every day when I do my daily recall I now also check through the lessons in the corresponding weekday and monthday tag.

Here's how my file currently looks like:
(I use some short codes for typing faster like "W=what", "h=how", "t=to", "w=with" and maybe some more.)

- Mon
    - [[lesson - clarify Gs on concrete examples]]
    - [[lesson - delegate whenever you can (including if possible large scale responsibilities where you need to find someone competent and get funding)]]
        - [[lesson - notice when i search for facts (e.g. w GPT) (as opposed to searching for understanding) and then perhaps delegate if possible]]
- Tue
    - [[lesson - do not waste time on designing details that i might want to change later]]
    - [[periodic reminder - stop and review what you'd do if you had pretty unlimited funding -> if it could speed you up, then perhaps try to find some]]
- Wed
    - [[lesson - try to find edge cases where your current model does not work well]]
    - notice when sth worked well (you made good progress) -> see h you did that (-> generalize W t do right next time)
- Thu
    - it's probably useless/counterproductive to apply effort for thinking. rather try to calmly focus your attention.
        - perhaps train to energize the thing you want to think about like a swing through resonance. (?)
- Fri
    - [[lesson - first ask W you want t use a proposal for rather than directly h you want proposal t look like]]
- Sat
    - [[lesson - start w simple plan and try and rv and replan, rather than overoptimize t get great plan directly]]
- Sun
    - group
        - plan for particular (S)G h t achieve it rather than find good general methodology for a large class of Gs
        - [[lesson - when possible t get concrete example (or observations) then get them first before forming models or plans on vague ideas of h it might look like]]
- 1
    - don't dive too deep into math if you don't want to get really good understanding (-> either get shallow or very deep model, not half-deep)
- 2
    - [[lesson - take care not to get sidetracked by math]]
- 3
    - [[lesson - when writing an important message or making a presentation, imagine what the other person will likely think]]
- 4
    - [[lesson - read (problem statements) precisely]]
- 5
    - perhaps more often ask myself "Y do i blv W i blv?" (e.g. after rc W i think are good insights/plans)
- 6
    - sometimes imagine W keepers would want you to do
- 7
    - group
        - beware conceptual limitations you set yourself
        - sometimes imagine you were smarter
- 8
    - possible tht patts t add
        - if PG not clear -> CPG
        - if G not clear -> CG
        - if not sure h continue -> P
        - if say sth abstract -> TBW
        - if say sth general -> E (example)
- 9
    - ,rc methodology i want t use (and Y)
        - Keltham methodology.
        - loop: pr -> gather obs -> carve into subprs -> attack a subpr
- 10
    - reminder of insights:
        - hyp that any model i have needs t be able t be applied on examples (?)
        - disentangle habitual execution from model building (??)
        - don't think too abstractly. see underlying structure to be able t carve reality better. don't be blinded by words. TBW.
            - don't ask e.g. W concepts are, but just look at observations and carve useful concepts anew.
        - form models of concrete cases and generalize later.
- 11
    - always do introspection/rationality-training and review practices. (except maybe in some sprints.)
- 12
    - Wr down questions towards the end of a session. Wr down questions after having formed some takeaway. (from Abram)
- 13
    - write out insights more in math (from Abram)
- 14
    - periodically write out my big picture of my research (from Abram)
- 15
    - Hoops. first clarify observations. note confusions. understand the problem.
- 16
    - have multiple hypotheses. including for plans as hypotheses of what's the best course of action.
- 17
    - actually fucking backchain. W are your LT Gs.
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
    - read https://www.lesswrong.com/posts/f2NX4mNbB4esdinRs/towards_keeperhood-s-shortform?commentId=D66XSCkv6Sxwwyeep

Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.

This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn't it better to start early? (If you have anything significant to say, of course.)

I'd imagine it not too hard to get >1OOM efficiency improvement which one can demonstrate in smaller AI and one might use this to get a lab to listen. If the labs are sufficiently uninterested in alignment it's pretty doomy anyway even if they adopted a better paradigm.

Also government interventions might still happen (perhaps more likely because of AI-caused unemployment than x-risk, and it won't buy amazingly much time, but still).

Also the strategy of "maybe if AIs are more rational they will solve alignment or at least realize that they cannot" seems also very unlikely to me to work on the current DL paradigm, though still slightly helpful.

(Also maybe some supergenius or my future self or some other group can figure something out.)

I don’t think that. See the bottom part of the comment you’re replying to. (The part after “Here’s what I would say instead:”)

Sry my comment was sloppy.

Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”).

(I agree the way I used sloppy in my comment mostly meant "weak". But some other thoughts:)

So I think there are some dimensions of intelligence which are more important for solving alignment than for creating ASI. If you read planecrash, WIS and rationality training seem to me more important in that way than INT.
I don't really have much hope for DL-like systems solving alignment but a similar case might be if an early transformative AI recognizes and says "no I cannot solve the alignment problem. the way my intelligence is shaped is not well suited to avoiding value drift. we should stop scaling and take more time where I work with very smart people like Eliezer etc for some years to solve alignment". And depending on the intelligence profile of the AI it might be more or less likely that this will happen (currently seems quite unlikely).
But overall those "better" intelligence dimensions still seem to me too central for AI capabilities, so I wouldn't publish stuff.

(Btw the way I read John's post was more like "fake alignment proposals are a main failure mode" rather than also "... and therefore we should work on making AIs more rational/sane whatever". So given that I maybe would defend John's framing, but not sure.)

So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.

If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?

Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.

If you really think you need to be similarly unsloppy to build ASI than to align ASI, I'd be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start).

(Tbc, I directionally agree with you that anti-slop is very useful AI capabilities and that I wouldn't publish stuff like Abram's "belief propagation" example.)

Load More