I'm glad you found it useful, even in this form. If the thing you're working on is something you could share, I'd be happy to offer further assistance, if you like.
Obviously this can't be answered with justice in a single comment, but here are some broad pointers that might help see the shape of the solution:
I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.
I would like to hear from others about this topic, though; I'm very wary of being at fault for accelerating the doom of humanity.
My project seems to have expired from the OWASP site, but here is an interactive version that should have most of the data:
https://periodictable.github.io/
You'll need to mouse over the elements to see the details, so not really mobile friendly, sorry.
I agree that linters are a weak form of automatic verification that are actually quite valuable. You can get a lot of mileage out of simply blacklisting unsafe APIs and a little out of clever pattern matching.
I would say that some formal proofs are actually impossible, but would agree that software with many (or even all) of the security properties we want could actually have formal-proof guarantees. I could even see a path to many of these proofs today.
While the intent of my post was to draw parallel lessons from software security, I actually think alignment is an oblique or orthogonal problem in many ways. I could imagine timelines in which alignment gets 'solved' before software security. In fact, I think survival timelines might even require anyone who migh...
The halting problem only makes it impossible to write a program that can analyze a piece of code and then reliably say "this is secure" or "this is insecure".
It would be nice to able to have this important impossible thing. :)
I think we are trying to say the same thing, though. Do you agree with this more concise assertion?
"It's not possible to make a high confidence checker system that can analyze an arbitrary specification, but it is probably possible (although very hard) to design systems that can be programmatically checked for the important qualities of alignment that we want, if such qualities can also be formally defined."
I would agree that some people figured this out faster than others, but the analogy is also instructional here: if even a small community like the infosec world has a hard time percolating information about failure modes and how to address them, we should expect the average ML engineer to be doing very unsafe things for a very long time by default.
To dive deeper into the XSS example, I think even among those that understood the output encoding and canonicalization solutions early, it still took a while to formalize the definition of an encoding context con...
I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:
Many! Thanks for sharing. This could easily turn into its own post.
In general, I think this is a great idea. I'm somewhat skeptical that this format would generate deep insights; in my experience successful Capture the Flag / wargames / tabletop exercises work best in the form where each group spends a lot of time preparing for their particular role, but opsec wargames are usually easier to score, so the judge role makes less sense there. That said, in the alignment world I'm generally supportive of trying as many different approaches as possible to see wh...
Thanks for the reply!
As some background on my thinking here, last I checked there are a lot of people on the periphery of the alignment community who have some proposal or another they're working on, and they've generally found it really difficult to get quality critical feedback. (This is based on an email I remember reading from a community organizer a year or two ago saying "there is a desperate need for critical feedback".)
I'd put myself in this category as well -- I used to write a lot of posts and especially comments here on LW summarizing how I'd g...
I definitely wouldn't rule out the possibility of being able to formally define a set of tests that would satisfy our demands for alignment. The most I could say with certainty is that it's a lot harder than eliminating software security bug classes. But I also wouldn't rule out the possibility that an optimizing process of arbitrarily strong capability simply could not be aligned, at least to a level of assurance that a human could comprehend.
Thank you for these additional references; I was trying to anchor this article with some very high-level concepts....
I got part of the way through the process and then got stuck, but my situation may not be typical.
I shudder to imagine the mutual funds created to fund bids on this thing.
How hard do you have squint to not see this thing as pyramid-shaped? This thing is like Sierpinski's pyramid. It's fractally a scam; a scam at every conceivable resolution.
Actually, the worst thing would be if the price of the minting increases at a rate slower than the value of half the pool grows. Then every next bid would still be "in the money", and then whoever doesn't go bankrupt first wins. This thing could eat the whole world. Terrible. Kill it with fire.
Well maybe I'm missing something, but the game theory doesn't seem that interesting to me. And calling it a 'return on investment' seems a bit generous for what is really just a game of blockchain chicken. In fact, it might be as crazy as a dollar auction where people might end up bidding more than what half the accumulated contract is worth due to sunk cost fallacy or other irrational behaviors.
Either way, you're not really buying anything of value here: you're just betting that the auction gets so little attention that you can walk away with free money, ...
What happens to the other half? This seems underspecified as you've described it.
In the interest of science, I ran 10 more simulations with our submitted population. This is not to open a can of worms or to challenge the results in any way - we all knew we had to win on the first try!
https://drive.google.com/file/d/1mSqaNlo5KT9l9vmY3ckd8KSTXA0xOz0u/view
Some things that I observed:
Here's our Brier scores for our predictions:
https://docs.google.com/spreadsheets/d/1qhuACrtD0esgCqz8rQvYcZOC0I1y1l66/edit#gid=225287990
The defenseless creature result really surprised most of us. Well done, aphyer, you knew what was up.
Of all the things, the coconuts were by far the most difficult to get anything to survive on. In my simulations, usually the coconut eaters that survived were also eating something else.
In theory, coconuts should sustain a 13.1 E creature; In practice, with such a small food source this size creature gets outcompeted at first by much smaller organisms that then get hunted to extinction by predators.
Ah, I read the wrong line. So yeah, we submitted the exact same creature.
There were definitely reliably BAD creatures, and certainly some reliably good ones, but a lot of variance based on the overall makeup of the population. I certainly didn't expect so many total creatures to be submitted; there was a lot more variability in results with 500-creature populations. In 5000-creature populations, basically the only thing that ever survived was invincibles.
With this size population, I don't think it's a coincidence that your minimal invincible survived - and certainly wasn't just luck that you arrived at its design. Give yourself SOME credit. :)
I submitted the exact same 10 speed leaf eater that you did, I just started it in the Temperate Forest. Luck of the draw that yours got here first, I guess.
Damn, now I'm upset I didn't spend more time thinking of a good name. A brown bear isn't even a pure predator! Really wish I had called THIS one the Trash Panda, instead. :)
Wait, are you initializing and running each biome separately? I expected all biomes to be seeded at once with the complete set of submitted organisms.
My definition of "minimal invincibles" here:
0 ATK, 10 DEF, 1SPD, Antivenom herbivore
OR
0 ATK, 0 DEF, 10SPD herbivore
These definitely win in a field of hundreds of participants. In my simulations, they were outcompeted by "less" invincible creatures fitting the invincible prototypes with 20-50 participants (200-500 creatures). I hedged my bets with a few invincibles, some hard-to-kills, and some things I found surprisingly hard to kill.
Also, my daughter's creature, so she has a chance to embarrass us all. :)
Did anyone find a way to reliably crash the populations of non-invincibles with fewer than 200 creatures (a reasonable amount of confederates you could wrangle)?
Embarrassing story:
I spent a lot of time writing a fast simulator and testing all kinds of approaches. Today I let my daughter (8) design a species without really understanding the game mechanics...and it performed better than every other creature on the first try. Granted, I had to help her correct some obviously suboptimal choices, but still...let's just say my confidence is not high.
I'll precommit to suggesting a secondary scoring mechanism for bragging rights: not simply the highest total number of surviving organisms but the total energy of the organisms (population * base energy).
Good luck everyone!
Can you give a more specific deadline? What timezone?
It would also be kind of a pain in the ass to change! :)
Not what I'm seeing. Roamers start roaming before the encounters in each biome, then after every biome is processed, the roamers find a new home. So the roamers go a whole generation without competing or foraging. Is that not what was intended?
I thought the same thing at first, but I think if the interact method is called with only one argument, then that creature ends up foraging normally. Since spawning depends on creature size and reproduction depends on energy, it seems equally likely that each biome will have an even number of creatures after each generation as they would odd. So this situation would happen whether roaming is occurring or not.
The tough situation is for carnivores; if they're the odd one out, they'll die, even if there are species that they could eat.
There is no initial check to see if a species can survive in its spawning biome. Obviously this doesn't matter for breathing, but species could live in the desert or tundra for free without the corresponding traits.
Ah, ok. So instead of competing in that generation, the individual roams.
If my understanding of the code is correct, if the organism successfully roams, it basically spawns another copy of itself, leaving the original behind to compete in the source biome . That organism isn't removed from the competition pool. Given the relatively low roaming rate, I'm not sure this makes a huge difference, but it doesn't seem like it should be intended behavior.
Can you elaborate on the winning condition? I expect most biomes will have surviving species; will that mean multiple winners, or will the ultimate winner be the species with the most total biomass? How long will the simulation be run? I can imagine stable equilibrium conditions with multiple survivors, even after an arbitrarily large number of simulation rounds.
Spelling: *dEtritus
Reading this reply, I was immediately reminded of a situation described by Jen Peeples, I think in an episode of The Atheist Experience, about her co-pilot's reaction of prayer during a life-threatening helicopter incident. ( This Comment is all I could find as reference. )
Unless your particular prayer technique is useful for quickly addressing emergency situations, you probably don't want to be in the habit of relying on it as a general practice. I think the "rubber duck" Socratic approach could still be useful, so this isn't a disagreement with...
Isn't there a separate axis for every aspect of human divergence? Maybe this was already explicit in asking if there is anything more complicated that romance for "multiplayer" relationships, but really this problem seems fully general: politics, or religion, or food, or any other preference that has a distribution among humans could be a candidate for creating schism (or indeed all axes at once). "Catgirl for romance" is one very specific failure mode, but the general one could be called "an echo chamber for every mind".
The e...
It was hard to muster a proper sense of indignation when you were confronting the same dignified witch who, twelve years and four months earlier, had given both of you two weeks' detention after catching you in the act of conceiving Tracey.
Given the fact that there is a Tracey, then that act of conception must have completed. So, either McGonagall caught them at exactly the right moment, or the Davises had just kept on going after they were caught...
No matter how it happened, this scene must have played out hilariously.
If consequentialism and deontology shared a common set of performance metrics, they would not be different value systems in the first place.
At least one performance metric that allows for the two systems to be different is: "How difficult is the value system for humans to implement?"
[edited out emotional commentary/snark]
I think what you mean to tell me is: "say 'my preferences' instead of 'my utility function'". I acknowledge that I was incorrectly using these interchangeably.
I do think it was clear what I meant when I called it "my" function and talked about it not conforming to VNM rules, so this response felt tautological to me.
I notice we're not understanding each other, but I don't know why. Let's step back a bit. What problem is "radiation poisoning for looking at magnitude of utility" supposed to be solving?
We're not talking about adding N to both sides of a comparison. We're talking about taking a relation where we are only allowed to know that A < B, multiplying B by some probability factor, and then trying to make some judgment about the new relationship between A and xB. The rule against looking at magnitudes prevents that. So we can't give an answer to the q...
It's too late for me. It might work to tell the average person to use "awesomeness" as their black box for moral reasoning as long as they never ever look inside it. Unfortunately, all of us have now looked, and so whatever value it had as a black box has disappeared.
You can't tell me now to go back and revert to my original version of awesome unless you have a supply of blue pills whenever I need them.
If the power of this tool evaporates as soon as you start investigating it, that strikes me as a rather strong point of evidence against it. It was fun while it lasted, though.
Ooops, you tried to feel a utility. Go directly to type theory hell; do not pass go, do not collect 200 utils.
I don't think this example is evidence against trying to 'feel' a utility. You didn't account for scope insensitivity and the qualitative difference between the two things you think you're comparing.
You need to compare the feeling of the turtle thrown against the wall to the cumulative feeling when you think about EACH individual beheading, shooting, orphaned child, open grave, and every other atrocity of the genocide. Thinking about the vague concept "genocide" doesn't use the same part of your brain as thinking about the turtle incident.
What I mean by "normalized" is that you're compressing the utility values into the range between 0 and 1. I am not aware of another definition that would apply here.
Your rule says you're allowed to compare, but your other rule says you're not allowed to compare by magnitude. You were serious enough about this second rule to equate it with radiation death.
You can't apply probabilities to utilities and be left with anything meaningful unless you're allowed to compare by magnitude. This is a fatal contradiction in your thesis. Using your own example...
No, I mean if my utility function violates transitivity or other axioms of VNM, I more want to fix it than to throw out VNM as being invalid.
I think I have updated slightly in the direction of requiring my utility function to conform to VNM and away from being inclined to throw it out if my preferences aren't consistent. This is probably mostly due to smart people being asked to give an example of a circular preference and my not finding any answer compelling.
Expectation. VNM isn't really useful without uncertainty. Without uncertainty, transitive preferences are enough.
I think I see the point you're trying to make, which is that we want to have a normalized scale of utility to apply probab...
That was one of the major points. Do not play with naked utilities. For any decision, find the 0 anchor and the 1 anchor, and rank other stuff relative to them.
I understood your major point about the radioactivity of the single real number for each utility, but I got confused by what you intended the process to look like with your hell example. I think you need to be a little more explicit about your algorithm when you say "find the 0 anchor and the 1 anchor". I defaulted to a generic idea of moral intuition about best and worst, then only mad...
"Awesomeness" is IMO the simplest effective pointer to morality that we currently have, but that morality is still inconsistent and dynamic.
The more I think about "awesomeness" as a proxy for moral reasoning, the less awesome it becomes and the more like the original painful exercise of rationality it looks.
I've been very entertained by this framing of the problem - very fun to read!
I find it strange that you claim the date with Satan is clearly the best option, but almost in the same breath say that the utility of whaling in the lake of fire is only 0.1% worse. It sounds like your definition of clarity is a little bit different from mine.
On the Satan date, souls are tortured, steered toward destruction, and tossed in a lake of fire. You are indifferent to those outcomes because they would have happened anyway (we can grant this a premise of the scenario). Bu...
Edited, thanks for the style correction.
I suspect you're probably right that more examples makes this more interesting, given the lack of upvotes. In fact, I probably found the quote relevant mostly because it more or less summed up the experience of my OWN life at the time I read it years ago.
I spent much of my youth being contrarian for contradiction's sake, and thinking myself to be revolutionary or somehow different from those who just joined the cliques and conformed, or blindly followed their parents, or any other authority.
When I realized that defin...
This is a great draft and you have collated many core ideas. Thank you for doing this!
As a matter of practical implementation, I think it's a good idea to always have a draft of official, approved statements of capabilities that can be rehearsed by any individual who may find themselves in a situation where they need to discuss them. These statements can be thoroughly vetted for second- and higher-order information leakage ahead of time, instead of trying to evaluate in real-time what their statements might reveal. It can be counterproductive in many circu... (read more)