NSFW text and images are not dangerous, and avoiding them is not relevant to the field of AI safety.
Sometimes a vague phrasing is not an inaccurate demarkation of a more precise concept, but an accurate demarkation of an imprecise concept
Yeah. It's possible to give quite accurate definitions of some vague concepts, because the words used in such definitions also express vague concepts. E.g. "cygnet" - "a young swan".
I would say that if a concept is imprecise, more words [but good and precise words] have to be dedicated to faithfully representing the diffuse nature of the topic. If this larger faithful representation is compressed down to fewer words, that can lead to vague phrasing. I would therefore often view vauge phrasing as a compression artefact, rather than a necessary outcome of translating certain types of concepts to words.
I'm confused about SAE feature descriptions. In Anthropic's and Google's demos both, there're a lot of descriptions that seem not to match a naked-eye reading of the top activations. (E.G. "Slurs targeting sexual orientation" also has a number of racial slurs in its top activations; the top activations for "Korean text, Chinese name yunfan, Unicode characters" are almost all the word "fused" in a metal-related context; etc.). I'm not sure if these short names are the automated Claude descriptions or if there are longer more accurate real descriptions somewhere; and if these are the automated descriptions, I'm not sure if there's some reason to think they're more accurate than they look, or if it doesn't matter if they're slightly off, or some third thing?
These are LLM generated labels, there are no "real" labels (because they're expensive!). Especially in our demo, Neuronpedia made them with gpt 3.5 which is kinda dumb.
I mostly think they're much better than nothing, but shouldn't be trusted, and I'm glad our demo makes this apparent to people! I'm excited about work to improve autointerp, though unfortunately the easiest way is to use a better model, which gets expensive
One can think of this as cases where auto-interp exhibits a precision-recall trade-off. At one extreme, you can generate super broad annotations like "all English text" to capture a a lot, which would overkill; and at the other end, you can generate very specific ones like "Slurs targeting sexual orientation" which would risk mislabeling, say, racial slurs.
Section 4.3 of the OpenAI SEA paper also discusses this point.
people sometimes talk about whether you should go into policy, ai research, ea direct work, etc. but afaict all of those fields work like normal careers where actually you have to spend several years resume-building before painstakingly convincing people you're worth hiring for paid work. so imo these are not actually high-leverage paths to impact and the fields are not in fact short on people.
Depends on what you mean by "resume building", but I don't think this is true for "need to do a bunch of AI safety work for free" or similar. i.e. for technical research, many people that have gone through MATS and then been hired at or founded their own safety orgs have no prior experience doing anything that looks like AI safety research, and some don't even have much in the way of ML backgrounds. Many people switch directly out of industry careers into doing e.g. ops or software work that isn't technical research. Policy might seem a bit trickier but I know several people who did not spend anything like years doing resume building before finding policy roles or starting their own policy orgs and getting funding. (Though I think policy might actually be the most "straightforward" to break into, since all you need to do to demonstrate compentence is publish a sufficiently good written artifact; admittedly this is mostly for starting your own thing. If you want to get hired at a "larger" policy org resume building might matter more.)
Hiring being highly selective does not imply things aren't constrained on people.
Getting 10x as many people as good as the top 20 best safety researchers would make a huge difference.
This argument has been had before on lesswrong. Usually the counter here is that we don’t actually know ahead of time who the top 20 people are, and so need to experiment & would do well to hedge our bets, which is the main constraint to getting a top 20. Currently we do this but only really do it for 1-2 years, but historically it actually takes more like 5 years to reveal yourself as a top 20, and I’d guess it actually actually can take more like 10 years.
So why not that funding model? Mostly a money thing.
I expect you will argue that in fact revealing yourself as a top 20 happens in fewer than 5 years, if you do argue.
Hmm, I really just mean that "labor" is probably the most important input to the current production function. I don't want to make a claim that there aren't better ways of doing things.
Ok, but when we ask why this constraint is tight, the answer is because there's not enough funding. We can't just increase the size of the field 10x in order to get 10x more top-20 researchers, because we don't have the money for that.
For example, suppose MATS suddenly & magically scaled up 10x, and their next cohort was 1,000 people. Would this dramatically change the state of the field? I don't think so.
Now suppose SFF & LTFF's budget suddenly & magically scaled up 10x. Would this dramatically change the state of the field? I think so!
Now suppose SFF & LTFF's budget suddenly & magically scaled up 10x. Would this dramatically change the state of the field? I think so!
I do think so, especially if they also increased/decentralized more their grantmaking capacity, and perhaps increased the field-building capacity earlier in the pipeline (e.g. AGISF, ML4G, etc., though I expect those programs to mostly be doing differentially quite well and not to be the main bottlenecks).
So why not that funding model? Mostly a money thing.
*seems like mostly a funding deployment issue, probably due to some structural problems, AFAICT, without having any great inside info (within the traditional AI safety funding space; the rest of the world seems much less on the ball than the traiditional AI safety funding space).
I don't understand what you mean. Do you mean there is lots of potential funding for AI alignment in eg governments, but that funding is only going to university researchers?
No, I mean that EA + AI safety funders probably would have a lot of money earmarked for AI risk mitigation, but they don't seem able/willing to deploy it fast enough (according to my timelines, at least, but probably also according to many of theirs).
Governments mostly just don't seem on the ball almost at all w.r.t. AI, even despite the recent progress (e.g. the AI safety summits, establishment of AISIs, etc.).
But legibility is a separate issue. If there are people who would potentially be good safety reseachers, but they get turned away by recruiters because they don't have a legibly impressive resume, then you have the companies lacking employees they would do well with if they had.
So, companies could be less constrained on people if they were more thorough in evaluating people on more than shallow easily-legible qualities.
Spending more money on this recruitment evaluation would thus help alleviate lack of good researchers. So money is tied into person-shortage in this additional way.
I agree that suboptimal recruiting/hiring also causes issues, but it isn't easy to solve this problem with money.
Here's my recommendation for solving this problem with money: have paid 1-2 month work trials for applicants. The person you hire to oversee these doesn't have to be super-competent themselves, they mostly a people-ops person coordinating the work-trialers. The outputs of the work could be relatively easily judged with just a bit of work from the candidate team (a validation-easier-than-production situation), and the physical co-location would give ample time for watercooler conversations to reveal culture-fit.
Here's another suggestion: how about telling the recruiters to spend the time to check personal references? This is rarely, if ever, done in my experience.
I'm pretty sure Ryan is rejecting the claim that the people hiring for the roles in question are worse-than-average at detecting illegible talent.
I will take Zvi's takeaways from his experience in this round of SFF grants as significant outside-view evidence for my inside view of the field.
Putting this short rant here for no particularly good reason but I dislike that people claim constraints here or there in a way where I guess their intended meaning is only that "the derivative with respect to that input is higher than for the other inputs".
On factory floors there exist hard constraints, the throughput is limited by the slowest machine (when everything has to go through this). The AI Safety world is obviously not like that. Increase funding and more work gets done, increase talent and more work gets done. None are hard constraints.
If I'm right that people are really only claiming the weak version, then I'd like to see somewhat more backing to their claims, especially if you say "definitely". Since none are constraints, the derivatives could plausibly be really close to one another. In fact, they kind of have to be, because there are smart optimizers who are deciding where to spend their funding and trying to actively manage the proportion of money sent to field building (getting more talent) vs direct work.
There is not a difference between the two situations in the way you're claiming, and indeed the differentiation point of view is used fruitfully on both factory floors and in more complex convex optimization problems. For example, see the connection between dual variables and their indication of how slack or taught constraints are in convex optimization, and how this can be interpreted as a relative tradeoff price between each of the constrained resources.
In your factory floor example, the constraints would be the throughput of each machine, and (assuming you're trying to maximize the throughput of the entire process), the dual variables would be zero everywhere except at that machine where it is the negative derivative of the throughput of the entire process with respect to the throughput of the constraining machine, and we could determine indeed the tight constraint is the throughput of the relevant machine by looking at the derivative which is significantly greater than all others.
Practical problems also often have a similar sparse structure to their constraining inputs too, but just because not every constraint is exactly zero except one doesn't mean those non-zero constraints are secretly not actually constraining, or its unprincipled to use the same math or intuitions to reason about both situations.
you have to spend several years resume-building before painstakingly convincing people you're worth hiring for paid work
For government roles, I think "years of experience" is definitely an important factor. But I don't think you need to have been specializing for government roles specifically.
Especially for AI policy, there are several programs that are basically like "hey, if you have AI expertise but no background in policy, we want your help." To be clear, these are often still fairly competitive, but I think it's much more about being generally capable/competent and less about having optimized your resume for policy roles.
Proposal: a react for 'took feedback well' or similar, to socially reward people for being receptive to criticism
Something that sounds patronizing is not a social reward. It's not necessarily possible to formulate in a way that avoids this problem, without doing something significantly indirect. Right now this is upvoting for unspecified reasons.
OpenAI is partnering with Anduril to develop models for aerial defense: https://www.anduril.com/article/anduril-partners-with-openai-to-advance-u-s-artificial-intelligence-leadership-and-protect-u-s/
The 3 most important paragraphs, extracted to save readers the trouble of clicking on a link:
The Anduril and OpenAI strategic partnership will focus on improving the nation’s counter-unmanned aircraft systems (CUAS) and their ability to detect, assess and respond to potentially lethal aerial threats in real-time.
[...]
The accelerating race between the United States and China to lead the world in advancing AI makes this a pivotal moment. If the United States cedes ground, we risk losing the technological edge that has underpinned our national security for decades.
[...]
These models, which will be trained on Anduril’s industry-leading library of data on CUAS threats and operations, will help protect U.S. and allied military personnel and ensure mission success.
The era of AGI means humans can no longer afford to live in a world of militarily competing nations. Whatever slim hope there might be for alignment and AI not-kill-everyone is sunk by militaries trying to out-compete each other in development of creatively malevolent and at least somewhat unaligned martial AI. At minimum we can't afford non-democratic or theocratically ruled nations, or even nations with unaccountable power-unto-themselves military, intelligence or science bureaucracies to control nukes, pathogen building biolabs or AGI. It will be necessary to enforce this even at the cost of war.
I'm against intuitive terminology [epistemic status: 60%] because it creates the illusion of transparency; opaque terms make it clear you're missing something, but if you already have an intuitive definition that differs from the author's it's easy to substitute yours in without realizing you've misunderstood.
I agree. This is unfortunately often done in various fields of research where familiar terms are reused as technical terms.
For example, in ordinary language "organic" means "of biological origin", while in chemistry "organic" describes a type of carbon compound. Those two definitions mostly coincide on Earth (most such compounds are of biological origin), but when astronomers announce they have found "organic" material on an asteroid this leads to confusion.
How often is signalling a high degree of precision without the reader understanding the meaning of the term more important than conveying a imprecise but broadly within the subject matter understanding of the content?
I'm not alexithymic; I directly experience my emotions and have, additionally, introspective access to my preferences. However, some things manifest directly as preferences which I have been shocked to realize in my old age, were in fact emotions all along. (In rare cases these are stronger than the ones directly-felt even, despite reliably seeming on initial inspection to be simply neutral metadata).
Specific examples would be nice. Not sure if I understand correctly, but I imagine something like this:
You always choose A over B. You have been doing it for such long time that you forgot why. Without reflecting about this directly, it just seems like there probably is a rational reason or something. But recently, either accidentally or by experiment, you chose B... and realized that experiencing B (or expecting to experience B) creates unpleasant emotions. So now you know that the emotions were the real cause of choosing A over B all that time.
(This is probably wrong, but hey, people say that the best way to elicit answer is to provide a wrong one.)
Here's an example for you: I used to turn the faucet on while going to the bathroom, thinking it was due simply to having a preference for somewhat-masking the sound of my elimination habits from my housemates, then one day I walked into the bathroom listening to something-or-other via earphones and forgetting to turn the faucet on only to realize about halfway through that apparently I actually didn't much care about such masking, previously being able to hear myself just seemed to trigger some minor anxiety about it I'd failed to recognize, though its absence was indeed quite recognizable—no aural self-perception, no further problem (except for a brief bit of disorientation from the mental-whiplash of being suddenly confronted with the reality that in a small way I wasn't actually quite the person I thought I was), not even now on the rare occasion that I do end up thinking about such things mid-elimination anyway.
Classic type of argument-gone-wrong (also IMO a way autistic 'hyperliteralism' or 'over-concreteness' can look in practice, though I expect that isn't always what's behind it): Ashton makes a meta-level point X based on Birch's meta point Y about object-level subject matter Z. Ashton thinks the topic of conversation is Y and Z is only relevant as the jumping-off point that sparked it, while Birch wanted to discuss Z and sees X as only relevant insofar as it pertains to Z. Birch explains that X is incorrect with respect to Z; Ashton, frustrated, reiterates that Y is incorrect with respect to X. This can proceed for quite some time with each feeling as though the other has dragged a sensible discussion onto their irrelevant pet issue; Ashton sees Birch's continual returns to Z as a gotcha distracting from the meta-level topic XY, whilst Birch in turn sees Ashton's focus on the meta-level point as sophistry to avoid addressing the object-level topic YZ. It feels almost exactly the same to be on either side of this, so misunderstandings like this are difficult to detect or resolve while involved in one.
Meta/object level is one possible mixup but it doesn't need to be that. Alternative example, is/ought: Cedar objects to thing Y. Dusk explains that it happens because Z. Cedar reiterates that it shouldn't happen, Dusk clarifies that in fact it is the natural outcome of Z, and we're off once more.
I think the ego is, essentially, the social model of the self. One's sense of identity is attached to it (effectively rendering it also the Cartesian homunculus), which is why ego death feels so scary to people, but (in most cases; I further theorize that people who developed their self-conceptions top-down, being likelier to have formed a self-model at odds with reality, are worse-affected here) the traits which make up the self-model's personality aren't stored in the model; it's merely a lossy description thereof and will rearise with approximately the same traits if disrupted.
I fully expect LLMs to hit a wall (if not now then in the future), but for any specific claims about timing, it's worth remembering that people frequently claim it's happening soon/has already happened, and will be wrong every time but one. Some past examples:
Facebook's then-head of AI, December 2019: https://www.wired.com/story/facebooks-ai-says-field-hit-wall/
Gary Marcus, March 2022: https://nautil.us/deep-learning-is-hitting-a-wall-238440/
Donald Hobson, August 2022: https://www.lesswrong.com/posts/gqqhYijxcKAtuAFjL/a-data-limited-future
Epoch AI, November 2022 (estimating high-quality language data exhausted by 2024; in 2024 they updated their projection to 2028): https://epoch.ai/blog/will-we-run-out-of-ml-data-evidence-from-projecting-dataset
Will Eden, February 2023 (thread): https://x.com/WilliamAEden/status/1630690003830599680
Sam Altman, April 2023: https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/
Justis Mills, June 2024: https://www.lesswrong.com/posts/axjb7tN9X2Mx4HzPz/the-data-wall-is-important (he cites Leopold Aschenbrenner for more detail, but Aschenbrenner himself is optimistic so I didn't link directly)
I more-or-less endorse the model described in larger language models may disappoint you [or, an eternally unfinished draft], and moreover I think language is an inherently lossy instrument such that the minimally-lossy model won't have perfectly learned the causal processes or whatever behind its production.
Necessary-but-not-sufficient condition for a convincing demonstration of LLM consciousness: a prompt which does not allude to LLMS, consciousness, or selfhood in any way.
I'm always much more interested in "conditional on an LLM being conscious, what would we be able to infer about what it's like to be it?" than the process of establishing the basic fact. This is related to me thinking there's a straightforward thing-it's-like-to-be a dog, duck, plant, light bulb, bacteria, internet router, fire, etc... if it interacts, then there's a subjective experience of the interaction in the interacting physical elements. Panpsychism of hard problem, compute dependence of easy problem. If one already holds this belief, then no LLM-specific evidence is needed to establish hard problem, and understanding the flavor of the easy problem is the interesting part.
You also would not be able to infer anything about its experience because the text it outputs is controlled by the prompt.
I am now convinced. In order to investigate, one must have some way besides prompts to do it. Something to do with the golden gate bridge, perhaps? Seems like more stuff like that could be promising. Since I'm starting from the assumption that it's likely, I'd want to check their consent first.