The way I'm using "sensitivity": sensitivity to X = the meaningfulness of X spurs responsive caring action.
I'm fine with that, although it seems important to have a definition for the more limited definition of sensitivity so we can keep track of that distinction: maybe adaptability?
...One of the main concerns of the discourse of aligning AI can also be phrased as issues with internalization: specifically, that of internalizing human values. That is, an AI’s use of the word “yesterday” or “love” might only weakly refer to the concepts you mean
I agree with you that there's a lot of interesting ideas here, but I would like to see the core arguments laid out more clearly.
Lots of interesting ideas here, but the connection to alignment still seems a bit vague.
Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values? It seems to me that an unaligned ASI is extremely sensitive to context, just in the service of its own goals.
Then again, maybe you see Live Theory as being more about figuring out what the outer objective should look like (broad principles that are then localised to specific contexts) rather than about figuring out how to ensure an AI internalises specific values. And I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.
This is one of those things that sounds nice on the surface, but where it's important to dive deeper and really probe to see if it holds up.
The real question for me seems to be whether organic alignment will lead to agents deeply adopting co-operative values rather than merely instrumentally adopting them. Well, actually it's a comparative between how deep organic alignment is vs. how deep traditional alignment is. And it's not at all clear to me why they think their approach is likely to lead to a deeper alignment.
I have two (extremely speculative) guesse...
I basically agree with this, but would perhaps avoid virtue ethics, but yes one of the main things I'd generally like to see is more LWers treating stuff like saving the world with the attitude you'd have from being in a job, perhaps at a startup or government bodies like the Senate or House of Representatives in say America, rather than viewing it as your heroic responsibility.
This is the right decision for most folk, but I expect the issue is more the opposite: we don't have enough folks treating this as their heroric responsibility.
I think both approaches have advantages.
The problem is that the Swiss cheese model and legislative efforts primarily just buy us time. We still need to be making progress towards a solution and whilst it's good for some folk to bet on us duct-taping our way through, I think we also want some folk attempting to work on things that are more principled.
Yeah, but how do you know that no one managed to sneak one past both you and the commentators?
Also, there's an art to this.
This seems to exist now.
Also, I did not realise that collapsable sections were a thing on Less Wrong. They seem really useful. I would like to see these promoted more.
They were in a kind of janky half-finished state before (only usable in posts not in comments, only usable from an icon in the toolbar rather than the <details> section); writing this policy reminded us to polish it up.
I'd love to see occasional experiments where either completely LLM-generated or lightly edited LLM content is submitted to Less Wrong to see how people respond (with this fact being revealed after). It would degrade the site if this happened too often, but I think it would sense for moderators to occasionally grant permission for this.
I tried an experiment with Wittgenstein's Language Games and the Critique of the Natural Abstraction Hypothesis back in March 2023 and it actually received (some) upvotes. I wonder how this would go with modern LLM's, though ...
We get easily like 4-5 LLM-written post submissions a day these days. They are very evidently much worse than the non-LLM written submissions. We sometimes fail to catch one, and then people complain: https://www.lesswrong.com/posts/PHJ5NGKQwmAPEioZB/the-unearned-privilege-we-rarely-discuss-cognitive?commentId=tnFoenHqjGQw28FdY
However, if you merely explain these constraints to the chat models, they'll follow your instructions sporadically.
I wonder if a custom fine-tuned model could get around this. Did you try few shot prompting (ie. examples, not just a description)?
I've written up an short-form argument for focusing on Wise AI advisors. I'll note that my perspective is different from that taken in the paper. I'm primarily interested in AI as advisors, whilst the authors focus more on AI acting directly in the world.
Wisdom here is an aid to fulfilling your values, not a definition of those values
I agree that this doesn't provide a definition of these values. Wise AI advisors could be helpful for figuring out your values, much like how a wise human would be helpful for this.
Other examples include buying poor quality food and then having to pay for medical care, buying a cheap car that costs more in repairs, payday loans, ect.
Unless you insist that this system is helpful for the powered privileges such as king, as a reference of the public opinion, that will be legit?
That would make the domain of checkable tasks rather small.
That said, it may not matter depending on the capability you want to measure.
If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.
On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.
I'd strongly recommend spending some time in the Bay area (or London as a second best option). Spending time in these spaces will help you build your model of the space.
You may also find this document I created on AI Safety & Entrepreneurship useful.
One of the biggest challenges here is that subsidies designed to be support alignment could be snagged by AI companies misrepresenting capabilities works as safety work. Do you think the government has the ability to differentiate between these?
Become a member of LessWrong or the AI Alignment Forum
I think the goal is for the alignment forum to be somewhat selective in terms of who can comment.
(Removed some of my comments b/c I just noticed the clarification that you meant average member of the EA forum/Less Wrong. I would suggest changing the title of your post though).
For the record, I see the new field of "economics of transformative AI" as overrated.
Economics has some useful frames, but it also tilts people towards being too "normy" on the impacts of AI and it doesn't have a very good track record on advanced AI so far.
I'd much rather see multidisciplinary programs/conferences/research projects, including economics as just one of the perspectives represented, then economics of transformative AI qua economics of transformative AI.
(I'd be more enthusiastic about building economics of transformative AI as a field if we w...
Points for creativity, though I'm still somewhat skeptical about the viability of this strategy,
My intuition would be that models learn to implement more general templates as well.
It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.
It's not clear to me that personality is completely separate from capabilities, especially with inference time reasoning.
Also, what do you mean by "bigger templates"?
I wonder about the extent to which having an additional level of selection helps.
High school curricula are generally limited by having to be able to be taught by a large number of teachers all around the country and by needing a minimum number of students at the school who are capable of the content.
If the préparatoires can put more qualified teachers and students together that would allow significant development and running selection for elite universities after such an intermediate preparatory program it would reduce the chance that talented students are...
Here's a short-form with my Wise AI advisors research direction: https://www.lesswrong.com/posts/SbAofYCgKkaXReDy4/chris_leong-s-shortform?view=postCommentsNew&postId=SbAofYCgKkaXReDy4&commentId=Zcg9idTyY5rKMtYwo
(I already posted this on the Less Wrong post).
I was taking it as "solves" or "gets pretty close to solving". Maybe that's a misinterpretation on my part. What did you mean here?
First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”)
I wouldn't quite go so far as to say it "tackles" the problem of outer alignment, but it does tie into (pragmatic) attempts to solve the problem by identifying the ontology of realistically specifiable reward functions. However, maybe I'm misunderstanding you?
I suspect that your post probably isn't going to be very legible to the majority folks on Less Wrong, since you're assuming familiarity with meta-modernism. To be honest, I suspect this post would have been more persuasive if you had avoided mentioning it, since the majority of folks here are likely skeptical of it and it hardly seems to be essential for making what seems to be the core point of your post[1]. Sometimes less is more. Things cut out can always explored in the future, when you have the time to explain them in a way that will be legible to you...
Interesting idea. Will be interesting to see if this works out.
Lenses are... tabs. Opinionated tabs
Could you explain the intended use further?
The central problem of any wiki system is [1]"what edits do you accept to a wiki page?". The lenses system is trying to provide a better answer to that question.
My default experience on e.g. Wikipedia when I am on pages where I am highly familiar with the domain is "man, I could write a much better page". But writing a whole better page is a lot of effort, and the default consequence of rewriting the page is that the editor who wrote the previous page advocates for your edits to be reverted, because they are attached to their version of the page. ...
My take: Counterfactuals are Confusing because of an Ontological Shift:
"In our naive ontology, when we are faced with a decision, we conceive of ourselves as having free will in the sense of there being multiple choices that we could actually take. These choices are conceived of as actual and we when think about the notion of the "best possible choice" we see ourselves as comparing actual possible ways that the world could be. However, we when start investigating the nature of the universe, we realise that it is essentially deterministic and hence that our...
Well, we're going to be training AI anyway. If we're just training capabilities, but not wisdom, I think things are unlikely to go well. More thoughts on this here.
I believe that Anthropic should be investigating artificial wisdom:
I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.
I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.
By Wise AI Advisors, I mean training an AI to provide wise advice. BTW, I've now added a link to a short-form post in my original comment where I detail the argument for wise AI advisors further.
Props for proposing a new and potentially fruitful framing.
I would like to propose training Wise AI Advisors as something that could potentially meet your two criteria:
• Even if AI is pretty much positive, wise AI advisors would allow us get closer to maximising these benefits
• We can likely save the world if we make sufficiently wise decisions[1]
There's also a chance that we're past the point of no return, but if that's the case, we're screwed no matter what we do. Okay, it's slightly more complicated because there's a chance that we aren't yet past the
Why the focus on wise AI advisors?[1]
I'll be writing up a proper post to explain why I've pivoted towards this, but it will still take some time to produce a high quality post, so I decided it was worthwhile releasing a short-form description in the mean time.
By Wise AI Advisors, I mean training an AI to provide wise advice.
a) AI will have a massive impact on society given the infinite ways to deploy such a general technology
b) There are lots of ways this could go well and lots of ways that this could go extremely poorly (election interference, cyber attac...
Thanks, seems pretty good on a quick skim, I'm a bit less certain on the corrigibility section, also more issues might become apparent if I read through it more slowly.
How about "Please summarise Eliezer Yudkowsky's views on decision theory and its relevance to the alignment problem".
It did OK at control.
Nice article, I especially love the diagrams!
In Human Researcher Obsolescence you note that we can't completely hand over research unless we manage to produce agents that are at least as "wise" as the human developers.
I agree with this, though I would love to see a future version of this plan include an expanded analysis of the role that wise AI plays would play in the strategy of Magma, as I believe that this could be a key aspect of making this plan work.
In particular:
• We likely want to be developing wise AI advisors to advise us during the pre-hand-off...
Fellowships are typically only for a few month and even if you're in India, you'd likely have to move for the fellowship unless it happened to be in your exact city.
Impact Academy was doing this, before they pivoted towards the Global AI Safety Fellowship. It's unclear whether any further fellowships should be in India or a country that is particularly generous with its visas.
I posted this comment on Jan's blog post
Underelicitation assumes a "maximum elicitation" rather than a never-ending series of more and more layers of elicitation that could be discovered.You've undoubtedly spent much more time thinking about this than I have, but I'm worried that attempts to maximise elicitation merely accelerate capabilities without actually substantially boosting safety.
In terms of infrastructure, it would be really cool to have a website collecting the more legible alignment research (papers, releases from major labs or non-profits).
I think I saw someone arguing that their particular capability benchmark was good for evaluating the capability, but of limited use for training the capability because their task only covered a small fraction of that domain.
(Disclaimer: I previously interned at Non-Linear)
Different formats allow different levels of nuance. Memes aren't essays and they shouldn't try to be.
I personally think these memes are fine and that outreach is too. Maybe these posts oversimplify things a bit too much for you, but I expect that average person on these subs probably improves the level of their thinking from seeing these memes.
If, for example, you think r/EffectiveAltruism should ban memes, then I recommend talking to the mods.
This seems to underrate the value of distribution. I suspect another factor to take into account is the degree of audience overlap. Like there's a lot of value in booking a guest who has been on a bunch of podcasts, so long as your particular audience isn't likely to have been exposed to them.