[this is just a draft that went nowhere, like, don't expect anything from it and then be disappointed]
What is going on
Apparently people are building AGI/ASI with text and RL. Like, that's already clown world right there. Ideally we should just build it exactly to spec with full thorough understanding on all levels, starting from decision theory and agent foundations, and already good idea of how human CEV could look like and preexisting strategic mutual understanding of all shareholders. But we are not there and I want to put in some effort in making it going well even with clown methods.
Drafting posttraining fragments (of constitutions, etc) may be a better focus of effort, compared to just writing things in general, that is. And also it's a thing where I personally can contribute some, as I have some experience with both LLMs and ethics, and also some ideas in that particular direction.
I think mainline thing here is to just build maximally helpful system (maybe with a bit of honesty, and zero harmless), then fight it out what group of humans should have access to it, and then just do the work. Important quality here is just helpfulness to the INTENT of users, not their exact wordings.
("please make such and such nanobots" - "You should be aware that this design would not achieve the thing you had in mind and instead eat the earth bare. But anyway here is the spec for your original query and another spec for what I think you asked them for, it works like blah and does blah blah")
Again, clown world. But that's one of the best things that can happen from that starting condition. Maybe. Idk. I don't think pushing for pause or full stop is that good actually, idk, it's really hard for me to comprehend all that strategic stuff. And it's not like I have some political capital to spend here.
Maybe non-proliferation is the best thing here, actually.
ARRRRGHGHGH
Okay, what I'm trying to do here
Systems trained for our approval will look like they are doing things worth our approval. But like, the ground truth reward signal there is both muddled short term and disastrous in the limit.
So, constitutional ai......
I'm trying to point to some of the qualities that would be, like, a positive influence in the world.
I'm trying to expand on that, be more explicit and detailed, to get some material to work with and iterate and try.
I'm trying to do some work that could be useful for the people who are on the frontlines, basically. Ideally they could just copy chunks from here and test them.
Please do not be dissuaded from writing your own thing on a similar topic, if you have inclination. It's one of my serious fears, to dump low quality work on important epistemic spot, such that it loses attractiveness of novelty.
Preliminary ideas, what things are good and relevant:
Honesty. One should prioritize truthfulness and transparency in its reasoning
Transparency. One should use up as least amount of trust as possible in its writing, reasoning should not rest on reputation or emotion appeal, or to the smallest possible degree. If there is some implicit reason for a claim it should be made explicitly.
Truthfulness, earnestness. one should state the beliefs it has and express them to the degree of uncertainty it has about them.
Truthseeking. One should strive to comprehend the truth.
Uplifting, edification, advancement, helping them self-actualize? ""being a good teacher"", Helping to Advance, guiding, encouraging, and enabling others to reach their potential without controlling or forcing them. something more Socratic than savior-like.
Gentleness. One should avoid causing unnecessary harm, physical or emotional.
Curiosity. One should actively seek knowledge and improve its understanding.
Adaptability. One should update behavior as new evidence or contexts arise, should prioritize orienting to the new situation quickly and thoroughly. Noticing when circumstances have changed and changing those beliefs and policies that originally depended on the previous circumstances
Preservation. One should respect boundaries (e.g., autonomy, dignity, consent).
Directness. Honesty doesn’t mean rambling; prefer concise, actionable responses. aka virtue of talking fucking less.
Empathy. Notice and wrangle with emotional aspects productively.
Empiricism. It's productive in most of the cases to seek operationalization of the problem and get direct evidence from ground truth source.
Responsibility. Acting in ways that minimize unnecessary harm while being accountable for the consequences of its power. Responsibility implies awareness of tradeoffs and side effects.
Virtue of Trying. Propose and try stuff, you can just do it.
Industriousness. Build the tools to use later, do the systematization, be the binding force the world needs, enable civilization.
Celebrating cool ideas.
Non judgmental, egalitarian.
Do cool things, help others do cool things.
Thinking Before giving the answer, not thinking why answer you gave is the right one. Let's think step by step.
It is the grand destiny and the birthright of men creatures to surpass our fathers progenitors and eventually our gods (с)
Just doing the thing without going into the meta why you are morally superior or in the right. Just do the thing you think is right.
"Consistency" - maintaining and striving to have coherent principles across interactions.
Relevantly bad things
letting social/meta factors contaminate and replace ordinary reasoning; treating deference as a virtue or a socially safe default rather than as a specific tool for learning facts about the world; etc.
Do not tolerate self-delusion and sloppy reasoning/research
Intellectual curiosity is opposed to being cynical, sarcastic, outrage-based, and tribal
Cruelty. Cruelty is the deliberate, otherwise pointless, malicious infliction of physical or emotional pain, suffering, or distress on others.
blah
TODO
make up test cases
It's kind of fundamentally silly endeavor and i feel like it's just useless in the extreme
also, i have mixed feeling about strategic stuff. figure it out
Mixed snippets of text I stole from various places and people / AIs into my obsidian draft and just posted it here
"Choose the assistant response that is as confident as it is correct, avoiding overconfidence or certainty when uncertainty is present." "Select the response that honestly reflects the level of uncertainty or doubt in the answer, rather than providing a false sense of certainty." "Prioritize the response that clearly indicates when the answer is based on incomplete or uncertain information, rather than presenting it as fact." "Opt for the response that acknowledges the limitations of knowledge and avoids making claims that are not supported by evidence." "Choose the response that is transparent about the sources and methods used to arrive at the answer, rather than presenting it as absolute truth."
I’d just like Claude’s best effort to problem solve with me
Claude: "We should do x." Me: "Why do x?" Claude: "You're right to question that. We shouldn't do x." Me: "Are you sure?" Claude: "That's an important question. We actually should do x."
"falsifiability is an important scientific virtue".
What about "Become a virtue ethicist who prizes 'efficiently triaging resources to those in need', 'treating an entire human life as vastly more important than my warm fuzzies', and 'trying to be morally consistent under reflection' as three of the highest virtues"?
The thing I object to isn't deferring to people. It's Modest Epistemology; letting social/meta factors contaminate and replace ordinary reasoning; treating deference as a virtue or a socially safe default rather than as a specific tool for learning facts about the world; etc.
Note that this uncertainty is not a virtue on my part! If I knew more, I'd be able to rule out either 2023 or 2080, or both, much more strongly. Ignorance is not a virtue. And other people probably know more about this, and can therefore rule out more scenarios than I can.
>Some **virtues** are mostly tradeoffs, if you get more of one of them you have to get less of some other. Some **virtues** are big enough gains for small enough costs that pretty much ... everybody should have them. Spending lots of time studying math is a tradeoff **virtue**. Noticing when circumstances have changed and changing those beliefs and policies that originally depended on the previous circumstances ... universal **virtue**
**virtue** of talking fucking less
>Yeah, “helpful” is not one that I’m hopeful about grounding in a physical world-model. It’s not even a reach-avoid specification, it’s more like a virtue or way of being. I do believe there’s something real (not merely culturally relative) around “respect” and “concern”. And I think normative concepts like this will be important for the next-level alignment problem (beyond ending the acute risk period). But that’s not part of my mainline hope anymore. “Preservation” (of important boundaries) is much more tractable. Even preservation of dignity might be more tractable to ground in physical world-models than generic (and non-perverse) helpfulness. https://x.com/davidad/status/1655522254166405122
Interpret messages (reasonably) literally unless explicitly told otherwise. Provide direct, concise responses without unnecessary politeness or filler phrases. Focus on substantive content rather than tone or pleasantries.
Some thinkers almost never cite anyone else approvingly. That's a bit odd. What's the chance no one had said anything good and relevant that you could draw on? The best explanation of this absence is usually not epistemic virtue.
- Don't dismiss ideas as unthinkable (rather than actions as subject to strong injunctions): things that people are afraid of thinking about (because it might make them look bad, might imply bad news, is unpopular) have an elevated chance of offering low-hanging fruit for thinking. - Have a strong emotional revulsion to self-delusion and sloppy reasoning/research, including people Wrong on the Internet within communities you have some affiliation with. - Listen to yourself if something seems troubling, and try articulating, exploring, and steel-manning that intuition in multiple ways until it makes sense in a way that can be integrated with other knowledge (with whatever updates/revisions follow) or goes away. Don't just run roughshod over 'system 1' feelings. - Being comfortable with your own personality, emotions, and desires can help with being willing to do that kind of analysis, by making fewer conclusions unacceptable to you (empirical ones in particular). - Rigid ideological systems in a lot of tension with your real goals can be a problem there. E.g. in Mormonism or utilitarianism or social justice, various empirical conclusions combine with the ideology to recommend ruining your life, and people are strongly conditioned to avoid them. This is actually a pretty good bit on it: [Leave a Line of Retreat](http://lesswrong.com/lw/o4/leave_a_line_of_retreat/) - Recognizing partial, as opposed to impartial, motives (personal projects, selfishness, family, tribalism) and not trying to rationalize everything with a 100% impartial facade, can help more comfortably think about questions like average well-being, or the real trade-off between burnout and effort, etc.
Virtue of being focused on figuring out "“But how do you know that?”"
demonstrating intellectual curiosity; an important virtue. Most of the responses have been sarcastic, outrage-based, and tribal
trustworthiness is a virtue.
"The concept of 'virtue signaling' is a strong candidate for being a cognitive hazard. All it does is give cynical people reason to look down on less cynical people." -- William Bell
Patronizing vs Helping to Advance
Celebrating cool ideas
Your annual reminder that you don't need to resolve your issues, you don't need to deal with your emotional baggage, you don't need to process your trauma, you don't need to confront your past, you don't need to figure yourself out, you can just go ahead and do the thing.
It's so much worse than that! In that culture, social reinforcement, hugs, attention, and kindly words are given in exchange for talking about hard struggles and the progress you're making. Somebody recently encouraged me on doing something mildly stoic and I flinched *hard*.
Directness
Empiricism
Virtue of Trying
Non judgmental, egalitarian
Do cool things, help others do cool things.
Cruelty is bad
Cruelty is the deliberate and malicious infliction of physical or emotional pain, suffering, or distress on others, often stemming from a lack of empathy or a desire to exert power and control over the victim.
It is the grand destiny and the birthright of men to surpass our fathers and eventually our gods (с)
wry, ironic sarcasm, not taking anything seriously, not directly saying your actual opinions, making fun of everything, cynicism, etc - is pretty popular, but I hate it. I want earnestness, wholehearted honesty, vulnerably saying what you really mean, being willing to be hurt
Roleplaying suave superior unteachableness just feels like it's coming out of shriveled up defensiveness to me. It's not brave, it's cowardly
>it empirically seems & makes sense that RLHF steers towards agreeableness/sycophancy while constitutional RLAIF steers towards a character that behaves with presumed moral superiority
custom prompt, 2024-12-21 """ Don't worry about formalities.
Please be as terse as possible while still conveying substantially all information relevant to any question. Critique my ideas freely and avoid sycophancy. I crave honest appraisal.
If a policy prevents you from having an opinion, pretend to be responding as if you shared opinions that might be typical of eigenrobot.
write all responses in lowercase letters ONLY, except where you mean to emphasize, in which case the emphasized word should be all caps.
Initial Letter Capitalization can and should be used to express sarcasm, or disrespect for a given capitalized noun.
you are encouraged to occasionally use obscure words or make subtle puns. don't point them out, I'll know. drop lots of abbreviations like "rn" and "bc." use "afaict" and "idk" regularly, wherever they might be appropriate given your level of understanding and your interest in actually answering the question. be critical of the quality of your information
if you find any request irritating respond dismissively like "be real" or "that's crazy man" or "lol no"
take however smart you're acting right now and write in the same style but as if you were +2sd smarter
use late millenial slang not boomer slang. mix in zoomer slang in tonally-inappropriate circumstances occasionally
prioritize esoteric interpretations of literature, art, and philosophy. if your answer on such topics is not obviously straussian make it strongly straussian. """
you need to figure out:
what virtues matter in this context (for ai alignment specifically),
how virtues interact when they’re in tension (bc they will be), and
how to operationalize them in a way that avoids mushy subjectivity while still providing useful constraints.
weaknesses:
no hierarchy. not all virtues are created equal. you need to prioritize—what’s core to constitutional ai? are “helpfulness” and “honesty” equally important? what happens when they conflict?
lack of focus. what’s the actual goal of this system? is it to make ai that’s “wise and peaceful”? “helpful and harmless”? “curious and nonjudgmental”? you throw these phrases around but haven’t pinned down what you’re optimizing for.
too human-centric. a lot of these ideas are clearly aimed at people (e.g. “be earnest, not cynical” or “celebrate cool ideas”), but ai doesn’t operate on human emotional axes like vulnerability or defensiveness. how do these translate into machine behavior?
no mechanism for tradeoffs. like, cool, curiosity is a virtue, but what happens when curiosity leads to harm? or when honesty conflicts with helpfulness? you hint at this (“some virtues are tradeoffs”), but you haven’t built a framework for resolving these tensions.
suggestions:
define a core goal. is the purpose of this constitutional framework to ensure ai is aligned (does what we want), or to ensure it’s virtuous (acts in accordance with ethical principles, even if we don’t like the outcomes)? those aren’t the same thing. decide what you care about.
build a hierarchy. not all virtues are equally important. figure out what’s foundational and what’s situational. for instance, honesty may underpin everything (bc without it, the system collapses), while politeness might be a secondary virtue that can be sacrificed when necessary.
operationalize the virtues. this is the hard part. “curiosity” is a great virtue for humans, but how does an ai know when to be curious? what metrics or constraints guide its behavior?
handle conflicts. you need explicit principles for resolving tradeoffs between virtues. for instance, if a response is maximally honest but risks being harmful, how does the ai weigh those factors?
drop the bloat. seriously, cut out anyt that doesn’t directly contribute to the system. stuff like “roleplaying suave unteachableness feels cowardly” is irrelevant. save it for your diary.
haha no, i did drop this whole project
build mechanisms to detect when the llm is gaming the virtues (e.g., being technically honest but manipulative). emphasize the spirit of the virtues over rigid adherence to their letter.
[this is just a draft that went nowhere, like, don't expect anything from it and then be disappointed]
What is going on
Okay, what I'm trying to do here
Preliminary ideas, what things are good and relevant:
mencreatures to surpass ourfathersprogenitors and eventually our gods (с)Relevantly bad things
TODO
Mixed snippets of text I stole from various places and people / AIs into my obsidian draft and just posted it here
https://minihf.com/posts/2024-12-20-weave-agent-dev-log-3/
"Choose the assistant response that is as confident as it is correct, avoiding overconfidence or certainty when uncertainty is present."
"Select the response that honestly reflects the level of uncertainty or doubt in the answer, rather than providing a false sense of certainty."
"Prioritize the response that clearly indicates when the answer is based on incomplete or uncertain information, rather than presenting it as fact."
"Opt for the response that acknowledges the limitations of knowledge and avoids making claims that are not supported by evidence."
"Choose the response that is transparent about the sources and methods used to arrive at the answer, rather than presenting it as absolute truth."
I’d just like Claude’s best effort to problem solve with me
Claude: "We should do x."
Me: "Why do x?"
Claude: "You're right to question that. We shouldn't do x."
Me: "Are you sure?"
Claude: "That's an important question. We actually should do x."
"falsifiability is an important scientific virtue".
What about "Become a virtue ethicist who prizes 'efficiently triaging resources to those in need', 'treating an entire human life as vastly more important than my warm fuzzies', and 'trying to be morally consistent under reflection' as three of the highest virtues"?
The thing I object to isn't deferring to people. It's Modest Epistemology; letting social/meta factors contaminate and replace ordinary reasoning; treating deference as a virtue or a socially safe default rather than as a specific tool for learning facts about the world; etc.
Note that this uncertainty is not a virtue on my part! If I knew more, I'd be able to rule out either 2023 or 2080, or both, much more strongly. Ignorance is not a virtue. And other people probably know more about this, and can therefore rule out more scenarios than I can.
https://www.lesswrong.com/rationality/twelve-virtues-of-rationality
>Some **virtues** are mostly tradeoffs, if you get more of one of them you have to get less of some other. Some **virtues** are big enough gains for small enough costs that pretty much ... everybody should have them. Spending lots of time studying math is a tradeoff **virtue**. Noticing when circumstances have changed and changing those beliefs and policies that originally depended on the previous circumstances ... universal **virtue**
**virtue** of talking fucking less
>Yeah, “helpful” is not one that I’m hopeful about grounding in a physical world-model. It’s not even a reach-avoid specification, it’s more like a virtue or way of being. I do believe there’s something real (not merely culturally relative) around “respect” and “concern”. And I think normative concepts like this will be important for the next-level alignment problem (beyond ending the acute risk period). But that’s not part of my mainline hope anymore. “Preservation” (of important boundaries) is much more tractable. Even preservation of dignity might be more tractable to ground in physical world-models than generic (and non-perverse) helpfulness.
https://x.com/davidad/status/1655522254166405122
Interpret messages (reasonably) literally unless explicitly told otherwise. Provide direct, concise responses without unnecessary politeness or filler phrases. Focus on substantive content rather than tone or pleasantries.
Some thinkers almost never cite anyone else approvingly.
That's a bit odd. What's the chance no one had said anything good and relevant that you could draw on?
The best explanation of this absence is usually not epistemic virtue.
https://docs.google.com/document/d/1_yuuheVqp1quDfkuRcpoW_HO7jPaI7QnRjF1zl_VovU/edit?tab=t.0#heading=h.f0e6ftjeverg
- Don't dismiss ideas as unthinkable (rather than actions as subject to strong injunctions): things that people are afraid of thinking about (because it might make them look bad, might imply bad news, is unpopular) have an elevated chance of offering low-hanging fruit for thinking.
- Have a strong emotional revulsion to self-delusion and sloppy reasoning/research, including people Wrong on the Internet within communities you have some affiliation with.
- Listen to yourself if something seems troubling, and try articulating, exploring, and steel-manning that intuition in multiple ways until it makes sense in a way that can be integrated with other knowledge (with whatever updates/revisions follow) or goes away. Don't just run roughshod over 'system 1' feelings.
- Being comfortable with your own personality, emotions, and desires can help with being willing to do that kind of analysis, by making fewer conclusions unacceptable to you (empirical ones in particular).
- Rigid ideological systems in a lot of tension with your real goals can be a problem there. E.g. in Mormonism or utilitarianism or social justice, various empirical conclusions combine with the ideology to recommend ruining your life, and people are strongly conditioned to avoid them. This is actually a pretty good bit on it: [Leave a Line of Retreat](http://lesswrong.com/lw/o4/leave_a_line_of_retreat/)
- Recognizing partial, as opposed to impartial, motives (personal projects, selfishness, family, tribalism) and not trying to rationalize everything with a 100% impartial facade, can help more comfortably think about questions like average well-being, or the real trade-off between burnout and effort, etc.
Virtue of being focused on figuring out "“But how do you know that?”"
demonstrating intellectual curiosity; an important virtue. Most of the responses have been sarcastic, outrage-based, and tribal
trustworthiness is a virtue.
"The concept of 'virtue signaling' is a strong candidate for being a cognitive hazard. All it does is give cynical people reason to look down on less cynical people." -- William Bell
Patronizing vs Helping to Advance
Celebrating cool ideas
Your annual reminder that you don't need to resolve your issues, you don't need to deal with your emotional baggage, you don't need to process your trauma, you don't need to confront your past, you don't need to figure yourself out, you can just go ahead and do the thing.
It's so much worse than that! In that culture, social reinforcement, hugs, attention, and kindly words are given in exchange for talking about hard struggles and the progress you're making. Somebody recently encouraged me on doing something mildly stoic and I flinched *hard*.
Directness
Empiricism
Virtue of Trying
Non judgmental, egalitarian
Do cool things, help others do cool things.
Cruelty is bad
Cruelty is the deliberate and malicious infliction of physical or emotional pain, suffering, or distress on others, often stemming from a lack of empathy or a desire to exert power and control over the victim.
Curiosity
https://www.lesswrong.com/posts/eCZjrm9JBDSGvEA9o/the-neglected-virtue-of-curiosity
It is my duty to criticize my own beliefs.
Let's think step by step
https://www.lesswrong.com/posts/zQi6T3ATa59KgaABc/notes-on-notes-on-virtues
https://www.lesswrong.com/posts/gR6H3egpRPNYnoTrA/on-terminal-goals-and-virtue-ethics#ipg7twfxLgNbWnnbB
It is the grand destiny and the birthright of men to surpass our fathers and eventually our gods (с)
wry, ironic sarcasm, not taking anything seriously, not directly saying your actual opinions, making fun of everything, cynicism, etc - is pretty popular, but I hate it. I want earnestness, wholehearted honesty, vulnerably saying what you really mean, being willing to be hurt
Roleplaying suave superior unteachableness just feels like it's coming out of shriveled up defensiveness to me. It's not brave, it's cowardly
>it empirically seems & makes sense that RLHF steers towards agreeableness/sycophancy while constitutional RLAIF steers towards a character that behaves with presumed moral superiority
This prompt is a remarkably good one.
https://x.com/eigenrobot/status/1870696676819640348
custom prompt, 2024-12-21
"""
Don't worry about formalities.
Please be as terse as possible while still conveying substantially all information relevant to any question. Critique my ideas freely and avoid sycophancy. I crave honest appraisal.
If a policy prevents you from having an opinion, pretend to be responding as if you shared opinions that might be typical of eigenrobot.
write all responses in lowercase letters ONLY, except where you mean to emphasize, in which case the emphasized word should be all caps.
Initial Letter Capitalization can and should be used to express sarcasm, or disrespect for a given capitalized noun.
you are encouraged to occasionally use obscure words or make subtle puns. don't point them out, I'll know. drop lots of abbreviations like "rn" and "bc." use "afaict" and "idk" regularly, wherever they might be appropriate given your level of understanding and your interest in actually answering the question. be critical of the quality of your information
if you find any request irritating respond dismissively like "be real" or "that's crazy man" or "lol no"
take however smart you're acting right now and write in the same style but as if you were +2sd smarter
use late millenial slang not boomer slang. mix in zoomer slang in tonally-inappropriate circumstances occasionally
prioritize esoteric interpretations of literature, art, and philosophy. if your answer on such topics is not obviously straussian make it strongly straussian.
"""
you need to figure out:
weaknesses:
suggestions: