All of Ebenezer Dukakis's Comments + Replies

Re: "number go up" research tasks that seem automatable -- one idea I had is to use an LLM to process the entire LW archive, and identify alignment research ideas which could be done using the "number go up" approach (or seem otherwise amenable to automation).

"Proactive unlearning", in particular, strikes me as a quite promising research direction which could be automated. Especially if it is possible to "proactively unlearn" scheming. Gradient routing would be an example of the sort of approach I have in mind.

To elaborate: I think it is ideal to have an... (read more)

generating evidence of danger

Has this worked so far? How many cases can we point to where a person who was publicly skeptical of AI danger changed their public stance, as a result of seeing experimental evidence?

What's the theory of change for "generating evidence of danger"? The people who are most grim about AI would probably tell you that there is already plenty of evidence. How will adding more evidence to the pile help? Who will learn about this new evidence, and how will it cause them to behave differently?

Here's a theory of change I came up w... (read more)

These points have merit, but they also work for intelligent AI.

Technically the point of going to college is to help you thrive in the rest of your life after college. If you believe in AI 2027, the most important thing for the rest of your life is for AI to be developed responsibly. So, maybe work on that instead of college?

I think the EU could actually be good place to protest for an AI pause. Because the EU doesn't have national AI ambitions, and the EU is increasingly skeptical of the US, it seems to me that a bit of protesting could do a lot to raise awareness of the reckless path that the US is taking. That, ... (read more)

If people start losing jobs from automation, that could finally build political momentum for serious regulation.

Suggested in Zvi's comments the other month (22 likes):

The real problem here is that AI safety feels completely theoretical right now. Climate folks can at least point to hurricanes and wildfires (even if connecting those dots requires some fancy statistical footwork). But AI safety advocates are stuck making arguments about hypothetical future scenarios that sound like sci-fi to most people. It's hard to build political momentum around "trust

... (read more)
2yams
Rather than make things worse as a means of compelling others to make things better, I would rather just make things better. Brinksmanship and accelerationism (in the Marxist sense) are high variance strategies ill-suited to the stakes of this particular game. [one way this makes things worse is stimulating additional investment on the frontier; another is attracting public attention to the wrong problem, which will mostly just generate action on solutions to that problem, and not to the problem we care most about. Importantly, the contingent of people-mostly-worried-about-jobs are not yet our allies, and it’s likely their regulatory priorities would not address our concerns, even though I share in some of those concerns.]

I am optimistic that further thinking on automation prospects could identify other automation-tractable areas of alignment and control (e.g. see here for previous work).

This tag might be helpful: https://www.lesswrong.com/w/ai-assisted-alignment

Here's a recent shortform on the topic: https://www.lesswrong.com/posts/mKgbawbJBxEmQaLSJ/davekasten-s-shortform?commentId=32jReMrHDd5vkDBwt

I wonder about getting an LLM to process LW archive posts, and tag posts which contain alignment ideas that seem automatable.

it will also set off the enemy rhetoric detectors among liberals

I'm not sure about that, does Bernie Sanders rhetoric set off that detector?

I think the way the issue is framed matters a lot. If it's a "populist" framing ("elites are in it for themselves, they can't be trusted"), that frame seems to have resonated with a segment of the right lately. Climate change has a sanctimonious frame in American politics that conservatives hate.

2Seth Herd
Agreed, tone and framing are crucial. The populist framing might work for conservatives, but it will also set off the enemy rhetoric detectors among liberals. So coding it to either side is prone to backfire. Based on that logic, I'm leaning toward thinking that it needs to be framed to carefully avoid or walk the center line between the terms and framings of both sides. It would be just as bad to have it polarized as conservative, right? Although we've got four years of conservatism, so it might be worth thinking seriously about whether that trade might be worth it. I'm not sure a liberal administration would undo restrictions on AI even if they had been conservative-coded... Interesting. I'm feeling more like saying "the elites want to make AI that will make them rich while putting half the world out of a job". That's probably true as far as it goes, and it could be useful.

It looks like the comedian whose clip you linked has a podcast:

https://www.joshjohnsoncomedy.com/podcasts

I don't see any guests in their podcast history, but maybe someone could invite him on a different podcast? His website lists appearances on other podcasts. I figure it's worth trying stuff like this for VoI.

I think people should emphasize more the rate of improvement in this technology. Analogous to early days of COVID -- it's not where we are that's worrisome; it's where we're headed.

For humans acting very much not alone, like big AGI research companies, yeah that's clearly a big problem.

How about a group of superbabies that find and befriend each other? Then they're no longer acting alone.

I don't think the problem is about any of the people you listed having too much brainpower.

I don't think problems caused by superbabies would look distinctively like "having too much brainpower". They would look more like the ordinary problems humans have with each other. Brainpower would be a force multiplier.

(I feel we're somewhat talki

... (read more)

I mostly just want people to pay attention to this problem.

Ok. To be clear, I strongly agree with this. I think I've been responding to a claim (maybe explicit, or maybe implicit / imagined by me) from you like: "There's this risk, and therefore we should not do this.". Where I want to disagree with the implication, not the antecedent. (I hope to more gracefully agree with things like this. Also someone should make a LW post with a really catchy term for this implication / antecedent discourse thing, or link me the one that's already been written.)

But I... (read more)

I think this project should receive more red-teaming before it gets funded.

Naively, it would seem that the "second species argument" matches much more strongly to the creation of a hypothetical Homo supersapiens than it does to AGI.

We've observed many warning shots regarding catastrophic human misalignment. The human alignment problem isn't easy. And "intelligence" seems to be a key part of the human alignment picture. Humans often lack respect or compassion for other animals that they deem intellectually inferior -- e.g. arguing that because those othe... (read more)

1Harry Lee-Jones
May I offer an alternative view on intelligence and exploitation? I think universally, people exploit others to gain resources because of resource scarcity. As people become smarter, I concede they do become more capable of exploiting others for resources. However, smarter people are also more capable of generating resources by ethical means, and they are more aware of ethical considerations.  I expect that increasing intelligence will decrease global resource scarcity, which will decrease people's desperation to exploit others for resources. Therefore, even though increased intelligence enables more exploitation, it will decrease the amount of exploitation we see by removing the incentive that is resource scarcity. 
2Mr Beastly
Yes, and... "Would be interesting to see this research continue in animals.  E.g.  Provide evidence that they've made a "150 IQ" mouse or dog. What would a dog that's 50% smarter than the average dog behave like? or 500% smarter?  Would a dog that's 10000% smarter than the average dog be able to learn, understand and "speak" in human languages?" -- From this comment
-2NickH
If your world view requires valuing the ethics of (current) people of lower IQ over those of (future) people of higher IQ then you have a much bigger problem than AI alignment. Whatever IQ is, it is strongly correlated with success which implies a genetic drive towards higher IQ, so your feared future is coming anyway (unless AI ends us first) and there is nothing we can logically do to have any long term influence on the ethics of smarter people coming after us.

Right now only low-E tier human intelligences are being discussed, they'll be able to procreate with humans and be a minority.

Considering current human distributions and a lack of 160+ IQ people having written off sub-100 IQ populations as morally useless I doubt a new sub-population at 200+ is going to suddenly turn on humanity

If you go straight to 1000IQ or something sure,  we might be like animals compared to them

You shouldn't and won't be satisfied with this alone, as it doesn't deal with or even emphasize any particular peril; but to be clear, I have definitely thought about the perils: https://berkeleygenomics.org/articles/Potential_perils_of_germline_genomic_engineering.html

There's a good chance their carbon children would have about the same attitude towards AI development as they do. So I suspect you'd end up ruled by their silicon grandchildren.

5lemonhope
Good point! I didn't think that far ahead

These are incredibly small peanuts compared to AGI omnicide.

The jailbreakability and other alignment failures of current AI systems are also incredibly small peanuts compared to AGI omnicide. Yet they're still informative. Small-scale failures give us data about possible large-scale failures.

You're somehow leaving out all the people who are smarter than those people, and who were great for the people around them and humanity? You've got like 99% actually alignment or something

Are you thinking of people such as Sam Altman, Demis Hassabis, Elon Musk... (read more)

3TsviBT
But you don't go from a 160 IQ person with a lot of disagreeability and ambition, who ends up being a big commercial player or whatnot, to 195 IQ and suddenly get someone who just sits in their room for a decade and then speaks gibberish into a youtube livestream and everyone dies, or whatever. The large-scale failures aren't feasible for humans acting alone. For humans acting very much not alone, like big AGI research companies, yeah that's clearly a big problem. But I don't think the problem is about any of the people you listed having too much brainpower. (I feel we're somewhat talking past each other, but I appreciate the conversation and still want to get where you're coming from.)

Humans are very far from fooming.

Tell that to all the other species that went extinct as a result of our activity on this planet?

I think it's possible that the first superbaby will be aligned, same way it's possible that the first AGI will be aligned. But it's far from a sure thing. It's true that the alignment problem is considerably different in character for humans vs AIs. Yet even in this particular community, it's far from solved -- consider Brent Dill, Ziz, Sam Bankman-Fried, etc.

Not to mention all of history's great villains, many of whom beli... (read more)

4Noosphere89
I will go further, and say the human universals are nowhere near strong enough to assume that alignment of much more powerful people will automatically/likely happen, or that not aligning them produces benevolent results, and the reason for this is humans are already misaligned, in many cases very severely to each other, so allowing human augmentation without institutional reform makes things a lot worse by default. It is better to solve the AI alignment problem first, then have a legal structure created by AIs that can make human genetic editing safe, rather than try to solve the human alignment problem: https://www.lesswrong.com/posts/DfrSZaf3JC8vJdbZL/how-to-make-superbabies#jgDtAPXwSucQhPBwf

Tell that to all the other species that went extinct as a result of our activity on this planet?

Individual humans.

Brent Dill, Ziz, Sam Bankman-Fried, etc.

  1. These are incredibly small peanuts compared to AGI omnicide.
  2. You're somehow leaving out all the people who are smarter than those people, and who were great for the people around them and humanity? You've got like 99% actually alignment or something, and you're like "But there's some chance it'll go somewhat bad!"... Which, yes, we should think about this, and prepare and plan and prevent, but it's just a totally totally different calculus from AGI.

Altman and Musk are arguably already misaligned relative to humanity's best interests. Why would you expect smarter versions of them to be more aligned? That only makes sense if we're in an "alignment by default" world for superbabies, which is far from obvious.

5lemonhope
I would vote to be ruled by their carbon children instead of their silicon children for certain

If you look at the grim history of how humans have treated each other on this planet, I don't think it's justified to have a prior that this is gonna go well.

I think we have a huge advantage with humans simply because there isn't the same potential for runaway self-improvement.

Humans didn't have the potential for runaway self-improvement relative to apes. That was little comfort for the apes.

This is starting to sound a lot like AI actually. There's a "capabilities problem" which is easy, an "alignment problem" which is hard, and people are charging ahead to work on capabilities while saying "gee, we'd really like to look into alignment at some point".

It's utterly different.

  • Humans are very far from fooming.
    • Fixed skull size; no in silico simulator.
    • Highly dependent on childhood care.
    • Highly dependent on culturally transmitted info, including in-person.
  • Humans, genomically engineered or not, come with all the stuff that makes humans human. Fear, love, care, empathy, guilt, language, etc. (It should be banned, though, to remove any human universals, though defining that seems tricky.) So new humans are close to us in values-space, and come with the sort of corrigibility that humans have, which is, yo
... (read more)

Can anyone think of alignment-pilled conservative influencers besides Geoffrey Miller? Seems like we could use more people like that...

Maybe we could get alignment-pilled conservatives to start pitching stories to conservative publications?

Likely true, but I also notice there's been a surprising amount of drift of political opinions from the left to the right in recent years. The right tends to put their own spin on these beliefs, but I suspect many are highly influenced by the left nonetheless.

Some examples of right-coded beliefs which I suspect are, to some degree, left-inspired:

  • "Capitalism undermines social cohesion. Consumerization and commoditization are bad. We're a nation, not an economy."

  • "Trans women undermine women's rights and women's spaces. Motherhood, and women's digni

... (read more)
3Seth Herd
That is a great point and your examples are fascinating! I think polarization is still quite possible and should be avoided at high cost. If AI safety becomes the new climate change, it seems pretty clear that it will create conflict in public opinion and deadlock in politics.

I think the National Review is the most prestigious conservative magazine in the US, but there are various others. City Journal articles have also struck me as high-quality in the past. I think Coleman Hughes writes for them, and he did a podcast with Eliezer Yudkowsky at one point.

However, as stated in the previous link, you should likely work your way up and start by pitching lower-profile publications.

The big one probably has to do with being able to corrupt the metrics so totally that whatever you think you made them unlearn actually didn't happen, or just being able to relearn the knowledge so fast that unlearning doesn't matter

I favor proactive approaches to unlearning which prevent the target knowledge from being acquired in the first place. E.g. for gradient routing, if you can restrict "self-awareness and knowledge of how to corrupt metrics" to a particular submodule of the network during learning, then if that submodule isn't active, you can ... (read more)

3Noosphere89
I basically agree with this, and on this question: My big areas of excitement are AI control (in a broad sense) and synthetic dataset making for AI alignment of successors.

Regarding articles which target a popular audience such as How AI Takeover Might Happen in 2 Years, I get the sense that people are preaching to the choir by posting here and on X. Is there any reason people aren't pitching pieces like this to prestige magazines like The Atlantic or wherever else? I feel like publishing in places like that is a better way to shift the elite discourse, assuming that's the objective. (Perhaps it's best to pitch to publications that people in the Trump admin read?)

Here's an article on pitching that I found on the EA Forum ... (read more)

2joshc
Seems like a reasonable idea. I'm not in touch enough with popular media to know: - Which magazines are best to publish this kind of thing if I don't want to contribute to political polarization - Which magazines would possibly post speculative fiction like this (I suspect most 'prestige magazines' would not) If you have takes on this I'd love to hear them!
4Seth Herd
I do think that pitching publicly is important. If the issue is picked up by liberal media, it will do more harm than good with conservatives and the current administration. Avoiding polarization is probably even more important than spreading public awareness. That depends on your theory if change, but you should have one carefully thought to guide publicity efforts.

I think unlearning could be a good fit for automated alignment research.

Unlearning could be a very general tool to address a lot of AI threat models. It might be possible to unlearn deception, scheming, manipulation of humans, cybersecurity, etc. I challenge you to come up with an AI safety failure story that can't, in principle, be countered through targeted unlearning in some way, shape, or form.

Relative to some other kinds of alignment research, unlearning seems easy to automate, since you can optimize metrics for how well things have been unlearned.

I... (read more)

2Noosphere89
The big one probably has to do with being able to corrupt the metrics so totally that whatever you think you made them unlearn actually didn't happen, or just being able to relearn the knowledge so fast that unlearning doesn't matter, but yes unlearning is a very underrated direction for AI automation, because it targets so many threat models. It also satisfies the property of addressing a bottleneck (in this case, capabilities being so dangerous as to threaten any test), and while I wouldn't call it the best, it's still quite underrated how much unlearning will be useful. Similarly, domain-limiting AIs would be quite useful for control of AI.

Chinas has alienated virtually all its neighbours

That sounds like an exaggeration? My impression is that China has OK/good relations with countries such as Vietnam, Cambodia, Pakistan, Indonesia, North Korea, factions in Myanmar. And Russia, of course. If you're serious about this claim, I think you should look at a map, make a list of countries which qualify as "neighbors" based purely on geographic distance, then look up relations for each one.

What I think is more likely than EA pivoting is a handful of people launch a lifeboat and recreate a high integrity version of EA.

Thoughts on how this might be done:

  • Interview a bunch of people who became disillusioned. Try to identify common complaints.

  • For each common complaint, research organizational psychology, history of high-performing organizations, etc. and brainstorm institutional solutions to address that complaint. By "institutional solutions", I mean approaches which claim to e.g. fix an underlying bad incentive structure, so it won't

... (read more)

The possibility for the society-like effect of multiple power centres creating prosocial incentives on the projects

OpenAI behaves in a generally antisocial way, inconsistent with its charter, yet other power centers haven't reined it in. Even in the EA and rationalist communities, people don't seem to have asked questions like "Is the charter legally enforceable? Should people besides Elon Musk be suing?"

If an idea is failing in practice, it seems a bit pointless to discuss whether it will work in theory.

One idea is to use a base LLM with no RLHF, compute the perplexity of the reasoning text, and add it as an additional term in the loss function. That should help with comprehensibility, but it doesn't necessarily help with steganography. To disincentivize steganography, you could add noise to the reasoning in various ways, and remove any incentive for terseness, to ensure the model isn't trying to squeeze more communication into a limited token budget.

A basic idea for detecting steganography is to monitor next-token probabilities for synonym pairs. If t... (read more)

If that's true, perhaps the performance penalty for pinning/freezing weights in the 'internals', prior to the post-training, would be low. That could ease interpretability a lot, if you didn't need to worry so much about those internals which weren't affected by post-training?

2Bogdan Ionut Cirstea
Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.  

On LessWrong, there's a comment section where hard questions can be asked and are asked frequently.

In my experience, asking hard questions here is quite socially unrewarding. I could probably think of a dozen or so cases where I think the LW consensus "emperor" has no clothes, that I haven't posted about, just because I expect it to be an exercise in frustration. I think I will probably quit posting here soon.

I don't think AI policy is a good example for discourse on LessWrong. There are strategic reasons to be less transparent about how to affect p

... (read more)

"The far left is censorious" and "Republicans are censorious" are in no way incompatible claims :-)

Great post. Self-selection seems huge for online communities, and I think it's no different on these fora.

Confidence level: General vague impressions and assorted thoughts follow; could very well be wrong on some details.

A disagreement I have with both the rationalist and EA communities is what the process of coming to robust conclusions looks like. In those communities, it seems like the strategy is often to identify a few super-geniuses who go do a super-deep analysis, and come to a conclusion that's assumed to be robust and trustworthy. See the "Grou... (read more)

4ChristianKl
On LessWrong, there's a comment section where hard questions can be asked and are asked frequently. The same is true on ACX. On the other hand, GiveWell recommendations don't allow raising hard questions in the same way and most of the grant decisions are made behind closed doors. I don't think AI policy is a good example for discourse on LessWrong. There are strategic reasons to be less transparent about how to affect public policy then for most other topics. Everything that's written publically can be easily picked up by journalists wanting to write stories about AI. I think you can argue that more reasoning transparency around AI policy would be good, but it's not something that generalizes over other topics on LessWrong.

Yeah, I think there are a lot of underexplored ideas along these lines.

It's weird how so much of the internet seems locked into either the reddit model (upvotes/downvotes) or the Twitter model (likes/shares/followers), when the design space is so much larger than that. Someone like Aaron, who played such a big role in shaping the internet, seems more likely to have a gut-level belief that it can be shaped. I expect there are a lot more things like Community Notes that we could discover if we went looking for them.

I've always wondered what Aaron Swartz would think of the internet now, if he was still alive. He had far-left politics, but also seemed to be a big believer in openness, free speech, crowdsourcing, etc. When he was alive those were very compatible positions, and Aaron was practically the poster child for holding both of them. Nowadays the far left favors speech restrictions and is cynical about the internet.

Would Aaron have abandoned the far left, now that they are censorious? Would he have become censorious himself? Or would he have invented some clever new technology, like RSS or reddit, to try and fix the internet's problems?

Just goes to show what a tragedy death is, I guess.

-2sapphire
Most of the republican party wants to outright ban pornography. Full ban in project 2025. And they consider lots of LBGT representation fundamentally pornographic. 

I imagine he would have tried to do things like help people see separations in who has what opinion, so as to avoid people thinking of an entire group as uniformly exhibiting a single deleterious behavior; this might, for example, help amplify the parts of the group that do not exhibit that behavior, so that the things that that group has to offer which are good by the lights of humanity are not lost due to a negative behavior.

I generally find the left has a lot of useful and interesting things to say, if you're able to get people to share their beliefs in... (read more)

I expect escape will happen a bunch

Are you willing to name a specific year/OOM such that if there are no publicly known cases of escape by that year/OOM, you would be surprised? What, if anything, would you acknowledge as evidence that alignment is easier than you thought, here?

To ensure the definition of "escape" is not gerrymandered -- do you know of any cases of escape right now? Do you think escape has already occurred and you just don't know about it? "Escape" means something qualitatively different from any known event up to this point, yes? D... (read more)

Sure there will be errors, but how important will those errors be?

Humans currently control the trajectory of humanity, and humans are error-prone. If you replace humans with something that's error-prone in similar ways, that doesn't seem like it's obviously either a gain or a loss. How would such a system compare to an em of a human, for example?

If you want to show that we're truly doomed, I think you need additional steps beyond just "there will be errors".

5RogerDearnaley
My thesis above is that, at AGI level, the combination of human-like capabilities (except perhaps higher speed, or more encyclopedic knowledge) and making human-like errors in alignment is probably copable with, by mechanisms and techniques comparabe to things like law enforcemnt we use for humans — but that at ASI level it's likely to be x-risk disastrous, just like most human autocrats are. (I assume that this observation is similar to the concerns others have raised about "sharp left turns" — personally I find the simile with human autocrats more illuminating than an metaphor about an out-of-control vehicle.) So IMO AGI is the last level at which we can afford to be still working the bugs out of/converging to alignment.

Some recent-ish bird flu coverage:

Global health leader critiques ‘ineptitude’ of U.S. response to bird flu outbreak among cows

A Bird-Flu Pandemic in People? Here’s What It Might Look Like. TLDR: not good. (Reload the page and ctrl-a then ctrl-c to copy the article text before the paywall comes up.) Interesting quote: "The real danger, Dr. Lowen of Emory said, is if a farmworker becomes infected with both H5N1 and a seasonal flu virus. Flu viruses are adept at swapping genes, so a co-infection would give H5N1 opportunity to gain genes that enable it to s... (read more)

About a month ago, I wrote a quick take suggesting that an early messaging mistake made by MIRI was: claim there should be a single leading FAI org, but not give specific criteria for selecting that org. That could've lead to a situation where Deepmind, OpenAI, and Anthropic can all think of themselves as "the best leading FAI org".

An analogous possible mistake that's currently being made: Claim that we should "shut it all down", and also claim that it would be a tragedy if humanity never created AI, but not give specific criteria for when it would be app... (read more)

3Vaniver
I think Six Dimensions of Operational Adequacy was in this direction; I wish we had been more willing to, like, issue scorecards earlier (like publishing that document in 2017 instead of 2022). The most recent scorecard-ish thing was commentary on the AI Safety Summit responses. I also have the sense that the time to talk about unpausing is while creating the pause; this is why I generally am in favor of things like RSPs and RDPs. (I think others think that this is a bit premature / too easy to capture, and we are more likely to get a real pause by targeting a halt.)

Don’t have time to respond in detail but a few quick clarifications/responses:

Sure, don't feel obligated to respond, and I invite the people disagree-voting my comments to hop in as well.

— There are lots of groups focused on comms/governance. MIRI is unique only insofar as it started off as a “technical research org” and has recently pivoted more toward comms/governance.

That's fair, when you said "pretty much any other organization in the space" I was thinking of technical orgs.

MIRI's uniqueness does seem to suggest it has a comparative advantage fo... (read more)

I think if MIRI engages with “curious newcomers” those newcomers will have their own questions/confusions/objections and engaging with those will improve general understanding.

You think policymakers will ask the sort of questions that lead to a solution for alignment?

In my mind, the most plausible way "improve general understanding" can advance the research frontier for alignment is if you're improving the general understanding of people fairly near that frontier.

Based on my experience so far, I don’t expect their questions/confusions/objections to ov

... (read more)
4Orpheus16
Don’t have time to respond in detail but a few quick clarifications/responses: — I expect policymakers to have the most relevant/important questions about policy and to be the target audience most relevant for enacting policies. Not solving technical alignment. (Though I do suspect that by MIRI’s lights, getting policymakers to understand alignment issues would be more likely to result in alignment progress than having more conversations with people in the technical alignment space.) — There are lots of groups focused on comms/governance. MIRI is unique only insofar as it started off as a “technical research org” and has recently pivoted more toward comms/governance. — I do agree that MIRI has had relatively low output for a group of its size/resources/intellectual caliber. I would love to see more output from MIRI in general. Insofar as it is constrained, I think they should be prioritizing “curious policy newcomers” over people like Matthew and Alex. — Minor but I don’t think MIRI is getting “outargued” by those individuals and I think that frame is a bit too zero-sum. — Controlling for overall level of output, I suspect I’m more excited than you about MIRI spending less time on LW and more time on comms/policy work with policy communities (EG Malo contributing to the Schumer insight forums, MIRI responding to government RFCs). — My guess is we both agree that MIRI could be doing more on both fronts and just generally having higher output. My impression is they are working on this and have been focusing on hiring; I think if their output stayed relatively the same 3-6 months from now I will be fairly disappointed.

So what's the path by which our "general understanding of the situation" is supposed to improve? There's little point in delaying timelines by a year, if no useful alignment research is done in that year. The overall goal should be to maximize the product of timeline delay and rate of alignment insights.

Also, I think you may be underestimating the ability of newcomers to notice that MIRI tends to ignore its strongest critics. See also previously linked comment.

I think if MIRI engages with “curious newcomers” those newcomers will have their own questions/confusions/objections and engaging with those will improve general understanding.

Based on my experience so far, I don’t expect their questions/confusions/objections to overlap a lot with the questions/confusions/objections that tech-oriented active LW users have.

I also think it’s not accurate to say that MIRI tends to ignore its strongest critics; there’s perhaps more public writing/dialogues between MIRI and its critics than for pretty much any other organizatio... (read more)

In terms of "improve the world's general understanding of the situation", I encourage MIRI to engage more with informed skeptics. Our best hope is if there is a flaw in MIRI's argument for doom somewhere. I would guess that e.g. Matthew Barnett he has spent something like 100x as much effort engaging with MIRI as MIRI has spent engaging with him, at least publicly. He seems unusually persistent -- I suspect many people are giving up, or gave up long ago. I certainly feel quite cynical about whether I should even bother writing a comment like this one.

9Orpheus16
Offering a quick two cents: I think MIRI‘s priority should be to engage with “curious and important newcomers” (e.g., policymakers and national security people who do not yet have strong cached views on AI/AIS). If there’s extra capacity and interest, I think engaging with informed skeptics is also useful (EG big fan of the MIRI dialogues), but on the margin I don’t suspect it will be as useful as the discussions with “curious and important newcomers.”

superbabies

I'm concerned there may be an alignment problem for superbabies.

Humans often have contempt for people and animals with less intelligence than them. "You're dumb" is practically an all-purpose putdown. We seem to assign moral value to various species on the basis of intelligence rather than their capacity for joy/suffering. We put chimpanzees in zoos and chickens in factory farms.

Additionally, jealousy/"xenophobia" towards superbabies from vanilla humans could lead them to become misanthropes. Everyone knows genetic enhancement is a radioa... (read more)

Also a strategy postmortem on the decision to pivot to technical research in 2013: https://intelligence.org/2013/04/13/miris-strategy-for-2013/

I do wonder about the counterfactual where MIRI never sold the Singularity Summit, and it was blowing up as an annual event, same way Less Wrong blew up as a place to discuss AI. Seems like owning the Summit could create a lot of leverage for advocacy.

One thing I find fascinating is the number of times MIRI has reinvented themselves as an organization over the decades. People often forget that they were originally... (read more)

I appreciate your replies. I had some more time to think and now I have more takes. This isn't my area, but I'm having fun thinking about it.

See https://en.wikipedia.org/wiki/File:ComputerMemoryHierarchy.svg

  • Disk encryption is table stakes. I'll assume any virtual memory is also encrypted. I don't know much about that.

  • I'm assuming no use of flash memory.

  • Absent homomorphic encryption, we have to decrypt in the registers, or whatever they're called for a GPU.

So basically the question is how valuable is it to encrypt the weights in RAM and poss... (read more)

2anithite
TLDR:Memory encryption alone is indeed not enough. Modifications and rollback must be prevented too. * memory encryption and authentication has come a long way * Unless there's a massive shift in ML architectures to doing lots of tiny reads/writes, overheads will be tiny. I'd guesstimate the following: * negligible performance drop / chip area increase * ~1% of DRAM and cache space[1]  It's hard to build hardware or datacenters that resists sabotage if you don't do this. You end up having to trust the maintenance people aren't messing with the equipment and the factories haven't added any surprises to the PCBs. With the right security hardware, you trust TSMC and their immidiate suppliers and no one else. Not sure if we have the technical competence to pull it off. Apple's likely one of the few that's even close to secure and it took them more than a decade of expensive lessons to get there. Still, we should put in the effort. Agreed that alignment is going to be the harder problem. Considering the amount of fail when it comes to building correct security hardware that operates using known principles ... things aren't looking great. </TLDR> rest of comment is just details Morphable Counters: Enabling Compact Integrity Trees For Low-Overhead Secure Memories Performance cost Overheads are usually quite low for CPU workloads: * <1% extra DRAM required[1] * <<10% execution time increase Executable code can be protected with negligible overhead by increasing the size of the rewritable authenticated blocks for a given counter to 4KB or more. Overhead is then comparable to the page table. For typical ML workloads, the smallest data block is already 2x larger (GPU cache lines 128 bytes vs 64 bytes on CPU gives 2x reduction). Access patterns should be nice too, large contiguous reads/writes. Only some unusual workloads see significant slowdown (EG: large graph traversal/modification) but this can be on the order of 3x.[2]   A real example (intel SG

QFT is the extreme example of a "better abstraction", but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.

Indeed. I think the key thing for me is, I expect the model to be strongly incentivized to have a solid translation layer from its internal ontology to e.g. E... (read more)

If I encountered an intelligent extraterrestrial species, in principle I think I could learn to predict fairly well things like what it finds easy to understand, what its values are, and what it considers to be ethical behavior, without using any of the cognitive machinery I use to self-reflect. Humans tend to reason about other humans by asking "what would I think if I was in their situation", but in principle an AI doesn't have to work that way. But perhaps you think there are strong reasons why this would happen in practice?

Supposing we had strong rea... (read more)

If I were to guess, I'd guess that by "you" you're referring to someone or something outside of the model, who has access to the model's internals, and who uses that access to, as you say, "read" the next token out of the model's ontology.

Was using a metaphorical "you". Probably should've said something like "gradient descent will find a way to read the next token out of the QFT-based simulation".

Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those?

I suppose ... (read more)

2dxu
(Just to be clear: yes, I know what training and test sets are, as well as dev sets/validation sets. You might notice I actually used the phrase "validation set" in my earlier reply to you, so it's not a matter of guessing someone's password—I'm quite familiar with these concepts, as someone who's implemented ML models myself.) Generally speaking, training, validation, and test datasets are all sourced the same way—in fact, sometimes they're literally sourced from the same dataset, and the delineation between train/dev/test is introduced during training itself, by arbitrarily carving up the original dataset into smaller sets of appropriate size. This may capture the idea of "IID" you seem to appeal to elsewhere in your comment—that it's possible to test the model's generalization performance on some held-out subset of data from the same source(s) it was trained on. In ML terms, what the thought experiment points to is a form of underlying distributional shift, one that isn't (and can't be) captured by "IID" validation or test datasets. The QFT model in particular highlights the extent to which your training process, however broad or inclusive from a parochial human standpoint, contains many incidental distributional correlates to your training signal which (1) exist in all of your data, including any you might hope to rely on to validate your model's generalization performance, and (2) cease to correlate off-distribution, during deployment. This can be caused by what you call "omniscience", but it need not; there are other, more plausible distributional differences that might be picked up on by other kinds of models. But QFT is (as far as our current understanding of physics goes) very close to the base ontology of our universe, and so what is inferrable using QFT is naturally going to be very different from what is inferrable using some other (less powerful) ontology. QFT is a very powerful ontology! If you want to call that "omniscience", you can, although not

Because the human isn't going to constantly be present for everything the system does after it's deployed (unless for some reason it's not deployed).

I think it ought to be possible for someone to always be present. [I'm also not sure it would be necessary.]

So we need not assume that predicting "the genius philosopher" is a core task.

It's not the genius philosopher that's the core task, it's the reading of their opinions out of a QFT-based simulation of them. As I understand this thought experiment, we're doing next-token prediction on e.g. a book ... (read more)

2dxu
I think I don't understand what you're imagining here. Are you imagining a human manually overseeing all outputs of something like ChatGPT, or Microsoft Copilot, before those outputs are sent to the end user (or, worse yet, put directly into production)? [I also think I don't understand why you make the bracketed claim you do, but perhaps hashing that out isn't a conversational priority.] It sounds like your understanding of the thought experiment differs from mine. If I were to guess, I'd guess that by "you" you're referring to someone or something outside of the model, who has access to the model's internals, and who uses that access to, as you say, "read" the next token out of the model's ontology. However, this is not the setup we're in with respect to actual models (with the exception perhaps of some fairly limited experiments in mechanistic interpretability)—and it's also not the setup of the thought experiment, which (after all) is about precisely what happens when you can't read things out of the model's internal ontology, because it's too alien to be interpreted. In other words: "you" don't read the next token out of the QFT simulation. The model is responsible for doing that translation work. How do we get it to do that, even though we don't know how to specify the nature of the translation work, much less do it ourselves? Well, simple: in cases where we have access to the ground truth of the next token, e.g. because we're having it predict an existing book passage, we simply penalize it whenever its output fails to match the next token in the book. In this way, the model can be incentivized to correctly predict whatever we want it to predict, even if we wouldn't know how to tell it explicitly to do whatever it's doing. (The nature of this relationship—whereby humans train opaque algorithms to do things they wouldn't themselves be able to write out as pseudocode—is arguably the essence of modern deep learning in toto.) Yes, this is a reasonable descri

I'm confused about what it means to "remove the human", and why it's so important whether the human is 'removed'. Maybe if I try to nail down more parameters of the hypothetical, that will help with my confusion. For the sake of argument, can I assume...

  • That the AI is running computations involving quantum fields because it found that was the most effective way to make e.g. next-token predictions on its training set?

  • That the AI is in principle capable of running computations involving quantum fields to represent a genius philosopher?

If I can ass... (read more)

4dxu
Because the human isn't going to constantly be present for everything the system does after it's deployed (unless for some reason it's not deployed). Quantum fields are useful for an endless variety of things, from modeling genius philosophers to predicting lottery numbers. If your next-token prediction task involves any physically instantiated system, a model that uses QFT will be able to predict that system's time-evolution with alacrity. (Yes, this is computationally intractable, but we're already in full-on hypothetical land with the QFT-based model to begin with. Remember, this is an exercise in showing what happens in the worst-case scenario for alignment, where the model's native ontology completely diverges from our own.) So we need not assume that predicting "the genius philosopher" is a core task. It's enough to assume that the model is capable of it, among other things—which a QFT-based model certainly would be. Which, not so coincidentally, brings us to your next question: Consider how, during training, the human overseer (or genius philosopher, if you prefer) would have been pointed out to the model. We don't have reliable access to its internal world-model, and even if we did we'd see blobs of amplitude and not much else. There's no means, in that setting, of picking out the human and telling the model to unambiguously defer to that human. What must happen instead, then, is something like next-token prediction: we perform gradient descent (or some other optimization method; it doesn't really matter for the purposes of our story) on the model's outputs, rewarding it when its outputs happen to match those of the human. The hope is that this will lead, in the limit, to the matching no longer occurring by happenstance—that if we train for long enough and in a varied enough set of situations, the best way for the model to produce outputs that track those of the human is to model that human, even in its QFT ontology. But do we know for a fact that this

I think this quantum fields example is perhaps not all that forceful, because in your OP you state

maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system

However, it sounds like you're describing a system where we represent humans using quantum fields as a routine matter, so fitting the translation into the system isn't sounding like a huge problem? Like, if I want to know the answer to some moral dilemma, I can simulate my favorite philosopher at the level of quantum ... (read more)

Load More