Towards_Keeperhood

I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Wiki Contributions

Comments

Sorted by

Due to the generosity of ARIA, we will be able to offer a refund proportional to attendance, with a full refund for completion. The cost of registration is $200, and we plan to refund $25 for each week attended, as well as the final $50 upon completion of the course. We’ll ask participants to pay the registration fee once the cohort is finalized, so no fee is required to fill out the application form below.

Wait so do we get a refund if we decide we don't want to do the course, or if we manage to complete the course?

Like is it a refund in the "get your money back if you don't like it" sense, or is it incentive to not sign up and then not complete the course?

Nice post!

My key takeaway: "A system is aligned to human values if it tends to generate optimized-looking stuff which is aligned to human values."

I think this is useful progress. In particular it's good to try to aim for the AI to produce some particular result in the world, rather than trying to make the AI have some goal - it grounds you in the thing you actually care about in the end.

I'd say the "... aligned to human values part" is still underspecified (and I think you at least partially agree):

  • "aligned": how does the ontology translation between the representation of the "generated optimized-looking stuff" and the representation of human values look like?
  • "human values"
    • I think your model of humans is too simplistic. E.g. at the very least it's lacking a distinction like between "ego-syntonic" and "voluntary" as in this post, though I'd probably want a even significantly more detailed model. Also one might need different models for very smart and reflective people than for most people.
    • We haven't described value extrapolation.
      • (Or from an alternative perspective, our model of humans doesn't identify their relevant metapreferences (which probably no human knows fully explicitly, and for some/many humans it they might not be really well defined).)

Positive reinforcement for first trying to better understand the problem before running off and trying to solve it! I think that's the way to make progress, and I'd encourage others to continue work on more precisely defining the problem, and in particular on getting better models of human cognition to identify how we might be able to rebind the "human values" concept to a better model of what's happening in human minds.

Btw, I'd have put the corrigibility section into a separate post, it's not nearly up to the standards of the rest of this post.

To set expectations: this post will not discuss ...

Maybe you want to add here that this is not meant to be an overview of alignment difficulties, or an explanation for why alignment is hard.

Agree on that people focus a bit too much on scheming. It might be good for some people to think a bit more about the other failure modes you described, but the main thing that needs doing is very smart people making progress towards building an aligned AI, not defending against particular failure modes. (However, most people probably cannot usefully contribute to that, so maybe focusing on failure modes is still good for most people. Only that in any case there's the problem that people will find proposals that very likely don't actually work but which people can rather believe in that they work, and thereby making an AI stop a bit less likely.)

In general, I wish more people would make posts about books without feeling the need to do boring parts they are uninterested in (summarizing and reviewing) and more just discussing the ideas they found valuable. I think this would lower the friction for such posts, resulting in more of them. I often wind up finding such thoughts and comments about non-fiction works by LWers pretty valuable. I have more of these if people are interested.

I liked this post, thanks and positive reinforcement. In case you didn't already post your other book notes, just letting you know I'd be interested.

Do we have a sense for how much of the orca brain is specialized for sonar?

I don't know.

But evolution slides functions around on the cortical surface, and (Claude tells me) association areas like the prefrontal cortex are particularly prone to this.

It's particularly bad for cetaceans. Their functional mapping looks completely different.

Thanks. Yep I agree with you, some elaboration:

(This comment assumes you at least read the basic summary of my project (or watched the intro video).)

I know of Earth Species Project (ESP) and CETI (though I only read 2 publications of ESP and none of CETI).

I don't expect them to succeed in something equivalent to decoding orca language to an extent that we could communicate with them almost as richly as they communicate among each other. (Though like, if long-range sperm whales signals are a lot simpler they might be easier to decode.)

From what I've seen, they are mostly trying to throw AI at stuff and hoping somehow they will understand stuff, without having a clear plan how to actually decode it. The AI stuff might look advanced but it's sorta obvious things to try and I think it's unlikely to work very well, though still glad they are trying this.

If you look at orca vocalizations, it looks complex and alien. The patterns we can currently recognize there look very different from what we'd be able to see in an unknown human language. The embedding mapping might be useful if we had to decode a human language, and maybe we still learn some useful stuff from it, but for orca language we don't even know what their analog of words and sentences are and maybe their language works even somewhat differently (though I'd guess if they are smarter than humans there's probably going to be something like words and sentences - but they might be encoded differently in the signals than in human languages).

Though definitely plausible that AI can help significantly with decoding animal languages, but I think it also needs forming deep understanding of some things and I think it's likely too hard for ESP to succeed anytime soon, though like possible a supergenius could do it in a few years, but it would be really impressive.

My approach may fail, especially if orcas aren't at least roughly human-level smart, but it has the advantage that we can show orcas precise context of what some words and sentences mean, whereas we basically have almost no context data on recordings of orca vocalizations, so it's easier for them to see what some signals mean than for humans to infer what orca vocalizations mean. (Even if we had a lot of video datasets with vocalizations (which we don't), it's still a lot less context information about what they are talking about, than if they could show us images to indicate what they would talk about.) Of course humans have more research experience and better tools for decoding signals, but it doesn't look to me like anyone is currently remotely close, and my approach is much quicker to try and might have at least a decent chance. (I mean it nonzero worked with bottlenose dolphins (in terms of grammar better than with great apes), though I'd be a lot more ambitious.)

Of course, the language I create will also be alien for orcas, but I think if they are good enough at abstract pattern recognition they might still be able to learn it.

Answer by Towards_Keeperhood10

Perhaps also not what you're looking for, but you could check out the google hashcode archive (here's an example problem). I never participated though, so don't know whether they would make that great tests. But it seems to me like general ad-hoc problem solving capabilities are more useful in hashcode than in other competetive programming competitions.

GPT4 summary: "Google Hash Code problems are real-world optimization and algorithmic challenges that require participants to design efficient solutions for large-scale scenarios. These problems are typically open-ended and focus on finding the best possible solution within given constraints, rather than exact correctness."

Answer by Towards_Keeperhood10

Maybe not what you're looking for because it's not like one hard problem but more like many problems in a row, and generally I don't really know whether they are difficult enough, but you could (have someone) look into Exit games. Those are basically like escape rooms to go. I'd filter for Age16+ to hopefully filter for the hard ones, though maybe you'd want to separately look up which are particularly hard.

I did one or two when I was like 15 or 16 years old, and recently remembered them and I want to try some more for fun (and maybe also introspection), though I didn't get around to it yet. I think they are relatively ad-hoc puzzles though as with basically anything you can of course train to get good at Exit games in particular by practicing. (It's possible that I totally overestimate the difficulty and they are actually more boring than I expect.)

(Btw, probably even less applicable to what you are looking for, but CondingEscape is also really fun. Especially the "Curse of the five warriors" is good.)

I hope I will get around to rereading the post and edit this comment to write a proper review, but I'm pretty busy, so in case I don't I now leave this very shitty review here.

I think this is probably my favorite post from 2023. Read the post summary to see what it's about.

I don't remember a lot of the details from the post and so am not sure whether I agree with everything, but what I can say is:

  1. When I read it several months ago, it seemed to me like an amazingly good explanation for why and how humans fall for motivated reasoning.
  2. The concept of valence turned out very useful for explaining some of my thought processes, e.g. when I'm daydreaming something and asking myself why, then for the few cases where I checked it was always something that falls into "the thought has high valence" - like e.g. imagining some situation where I said something that makes me look smart.

Another thought, though I don't actually have any experience with this, but mostly doing attentive silent listening/observing might also be useful for learning how the other person is doing research.

Like, if it seems boring to just observe and occasionally say sth, try to better predict how the person will think or so.

Load More