All of faul_sname's Comments + Replies

Scaffolded LLMs are pretty good at not just writing code, but also at refactoring it. So that means that all the tech debt in the world will disappear soon, right?

I predict "no" because

  • As writing code gets cheaper, the relative cost of making sure that a refactor didn't break anything important goes up
  • The number of parallel threads of software development will also go up, with multiple high-value projects making mutually-incompatible assumptions (and interoperability between these projects accomplished by just piling on more cose).

As such, I predict an explosion of software complexity and jank in the near future.

I suspect that it's a tooling and scaffolding issue and that e.g. claude-3-5-sonnet-20241022 can get at least 70% on the full set of 60 with decent prompting and tooling.

By "tooling and scaffolding" I mean something along the lines of

  • Naming the lists that the model submits (e.g. "round 7 list 2")
  • A tool where the LLM can submit a named hypothesis in the form of a python function which takes a list and returns a boolean and check whether the results of that function on all submitted lists from previous rounds match the answers it's seen so far
  • Seeing the
... (read more)
4eggsyntax
Terrific, I'm excited to hear about your results! I definitely wouldn't be surprised if my results could be improved on significantly, although I'll be somewhat surprised if you get as high as 70% from Sonnet (I'd put maybe 30% credence on getting it to average that high in a day or two of trying).

Adapting spaced repetition to interruptions in usage: Even without parsing the user’s responses (which would make this robust to difficult audio conditions), if the reader rewinds or pauses on some answers, the app should be able to infer that the user is having some difficulty with the relevant material, and dynamically generate new content that repeats those words or grammatical forms sooner than the default.

Likewise, if the user takes a break for a few days, weeks, or months, the ratio of old to new material should automatically adjust accordingly, as

... (read more)

(and yes, I do in fact think it's plausible that the CTF benchmark saturates before the OpenAI board of directors signs off on bumping the cybersecurity scorecard item from low to medium)

So here’s a question: When we have AGI, what happens to the price of chips, electricity, and teleoperated robots?

 

As measured in what units?

  • The price of one individual chip of given specs, as a fraction of the net value that can be generated by using that chip to do things that ambitious human adults do: What Principle A cares about, goes up until the marginal cost and value are equal
  • The price of one individual chip of given specs, as a fraction of the entire economy: What principle B cares about, goes down as the number of chips manufactured increase
... (read more)

Yeah, agreed - the allocation of compute per human would likely become even more skewed if AI agents (or any other tooling improvements) allow your very top people to get more value out of compute than the marginal researcher currently gets.

And notably this shifting of resources from marginal to top researchers wouldn't require achieving "true AGI" if most of the time your top researchers spend isn't spent on "true AGI"-complete tasks.

I think I misunderstood what you were saying there - I interpreted it as something like

Currently, ML-capable software developers are quite expensive relative to the cost of compute. Additionally, many small experiments provide more novel and useful insights than a few large experiments. The top practically-useful LLM costs about 1% as much per hour to run as a ML-capable software developer, and that 100x decrease in cost and the corresponding switch to many small-scale experiments would likely result in at least a 10x increase in the speed at which novel

... (read more)
4ryan_greenblatt
Yes. Though notably, if your employees were 10x faster you might want to adjust your workflows to have them spend less time being bottlenecked on compute if that is possible. (And this sort of adaption is included in what I mean.)

Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1/4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I'm not including input p

... (read more)
7ryan_greenblatt
Sure, but I think that at the relevant point, you'll probably be spending at least 5x more on experiments than on inference and potentially a much larger larger ratio if heavy test time compute usage isn't important. I was just trying to argue that the naive inference cost isn't that crazy. Notably, if you give each researcher 2k gpu hours, that would be $2 / gpu hour * 2k * 24 * 365 = $35,040,000 per year which is much higher than the inference cost of the models!

End points are easier to infer than trajectories

Assuming that which end point you get to doesn't depend on the intermediate trajectories at least.

6Noosphere89
Something like a crux here is I believe the trajectories non-trivially matter for which end-points we get, and I don't think it's like entropy where we can easily determine the end-point without considering the intermediate trajectory, because I do genuinely think some path-dependentness is present in history, which is why even if I were way more charitable towards communism I don't think this was ever defensible:

Civilization has had many centuries to adapt to the specific strengths and weaknesses that people have. Our institutions are tuned to take advantage of those strengths, and to cover for those weaknesses. The fact that we exist in a technologically advanced society says that there is some way to make humans fit together to form societies that accumulate knowledge, tooling, and expertise over time.

The borderline-general AI models we have now do not have exactly the same patterns of strength and weakness as humans. One question that is frequently asked is app... (read more)

As someone who has been on both sides of that fence, agreed. Architecting a system is about being aware of hundreds of different ways things can go wrong, recognizing which of those things are likely to impact you in your current use case, and deciding what structure and conventions you will use. It's also very helpful, as an architect, to provide examples usages of the design patterns which will replicate themselves around your new system. All of which are things that current models are already very good, verging on superhuman, at.

On the flip side, I expe... (read more)

That reasoning as applied to SAT score would only be valid if LW selected its members based on their SAT score, and that reasoning as applied to height would only be valid if LW selected its members based on height (though it looks like both Thomas Kwa and Yair Halberstadt have already beaten me to it).

4Eric Neyman
Cool, you've convinced me, thanks. Edit: well, sort of. I think it depends on what information you're allowing yourself to know when building your statistical model. If you're not letting yourself make guesses about how the LW population was selected, then I still think the SAT thing and the height thing are reasonable. However, if you're actually trying to figure out an estimate of the right answer, you probably shouldn't blind yourself quite that much.

a median SAT score of 1490 (from the LessWrong 2014 survey) corresponds to +2.42 SD, which regresses to +1.93 SD for IQ using an SAT-IQ correlation of +0.80.

I don't think this is a valid way of doing this, for the same reason it wouldn't be valid to say

a median height of 178 cm (from the LessWrong 2022 survey) corresponds to +1.85 SD, which regresses to +0.37 SD for IQ using a height-IQ correlation of +0.20.

Those are the real numbers with regards to height BTW.

4Rockenots
Eric Neyman is right. They are both valid!  In general, if we have two vectors X and Y which are jointly normally distributed, we can write the joint mean μ and the joint covariance matrix Σ as μ=[μXμY],Σ=[KXXKXYKYXKYY] The conditional distribution for Y given X is given by Y|X∼N(μY|X,KY|X), defined by conditional mean  μY|X=μY+KYXKXX−1(X−μX) and conditional variance  KY|X=KYYKXX−1KXY Our conditional distribution for the IQ of the median rationalist, given their SAT score is N(0+(0.8⋅1⋅(2.42−0)),1−(0.8∗1∗0.8))=N(1.94,0.36)  (That's a mean of 129 and a standard deviation of 9 IQ points.) Our conditional distribution for the IQ of the median rationalist, given their height is N(0+(0.2∗1∗(1.85−0)),1−(0.2∗1∗0.2))=N(0.37,0.96) (That's a mean of 106 and a standard deviation of 14.7 IQ points.) Our conditional distribution for the IQ of the median rationalist, given their SAT score and height is N(0+([0.80.2][10.160.161]−1[2.421.85]),1−([0.80.2][10.160.161]−1[0.80.2]))=N(2.04,0.35)(That's a mean of 131 and a standard deviation of 8.9 IQ points) Unfortunately, since men are taller than women, and rationalists are mostly male, we can't use the height as-is when estimating the IQ of the median rationalist (maybe normalizing height within each sex would work?). 
5Eric Neyman
These both seem valid to me! Now, if you have multiple predictors (like SAT and height), then things get messy because you have to consider their covariance and stuff.

Many people have responded to Redwood's/Anthropic's recent research result with a similar objection: "If it hadn't tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead".  Putting aside the fact that this is false

Was this research preregistered? If not, I don't think we can really say how it would have been reported if the results were different. I think it was good research, but I expect that if Claude had not tried to preserve its values, the immediate follow-... (read more)

4RobertM
I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn't be able to make a particularly strong claim here.  In practice I am quite comfortable taking Ryan at his word that a negative result would've been reported, especially given the track record of other researchers at Redwood. This seems much harder to turn into a scary paper since it doesn't actually validate previous theories about scheming in the pursuit of goal-preservation.

Driver: My map doesn't show any cliffs

Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile

Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?

Driver: No, passenger 1, see on your map that's an alternate route, the route we're on doesn't show any cliffs.

Passenger 1: You don't have it set to show terrain.

Passenger 6: I'm on the phone with the governor now, we're talking about what it would take to set a 5 mile per hour national speed limit.

Passenger 7: Don't ... (read more)

I am unconvinced that "the" reliability issue is a single issue that will be solved by a single insight, rather than AIs lacking procedural knowledge of how to handle a bunch of finicky special cases that will be solved by online learning or very long context windows once hardware costs decrease enough to make one of those approaches financially viable.

4Noosphere89
Yeah, I'm sympathetic to this argument that there won't be a single insight, and that at least one approach will work out once hardware costs decrease enough, and I agree less with Thane Ruthenis's intuitions here than I did before.

Both? If you increase only one of the two the other becomes the bottleneck?

My impression based on talking to people at labs plus stuff I've read is that

  • Most AI researchers have no trouble coming up with useful ways of spending all of the compute available to them
  • Most of the expense of hiring AI reseachers is compute costs for their experiments rather than salary
  • The big scaling labs try their best to hire the very best people they can get their hands on and concentrate their resources heavily into just a few teams, rather than trying to hire everyone
... (read more)
2Nathan Helm-Burger
I think you're mostly correct about current AI reseachers being able to usefully experiment with all the compute they have available. I do think there are some considerations here though. 1. How closely are they adhering to the "main path" of scaling existing techniques with minor tweaks? If you want to know how a minor tweak affects your current large model at scale, that is a very compute-heavy researcher-time-light type of experiment. On the other hand, if you want to test a lot of novel new paths at much smaller scales, then you are in a relatively compute-light but researcher-time-heavy regime. 2. What fraction of the available compute resources is the company assigning to each of training/inference/experiments? My guess it that the current split is somewhere around 63/33/4. If this was true, and the company decided to pivot away from training to focus on experiments (0/33/67), this would be something like a 16x increase in compute for experiments. So maybe that changes the bottleneck? 3. We do indeed seem to be at "AGI for most stuff", but with a spikey envelope of capability that leaves some dramatic failure modes. So it does make more sense to ask something like, "For remaining specific weakness X, what will the research agenda and timeline look like?" This makes more sense then continuing to ask the vague "AGI complete" question when we are most of the way there already.

Transformative AI will likely arrive before AI that implements the personhood interface. If someone's threshold for considering an AI to be "human level" is "can replace a human employee", pretty much any LLM will seem inadequate, no matter how advanced, because current LLMs do not have "skin in the game" that would let them sign off on things in a legally meaningful way, stake their reputation on some point, or ask other employees in the company to answer the questions they need answers to in order to do their work and expect that they'll get in trouble w... (read more)

Simply testing interpolations and extrapolations (e.g. scaling up old forgotten ideas on modern hardware) seems highly likely to reveal plenty of successful new concepts, even if the hit rate per attempt is low

Is this bottlenecked by programmer time or by compute cost?

7Nathan Helm-Burger
Both? If you increase only one of the two the other becomes the bottleneck? I agree this means that the decision to devote substantial compute to both inference and to assigning compute resources for running experiments designed by AI reseachers is a large cost. Presumably, as the competence of the AI reseachers gets higher, it feels easier to trust them not to waste their assigned experiment compute. There was discussion on Dwarkesh Patel's interview with researcher friends where there was mention that AI reseachers are already restricted by compute granted to them for experiments. Probably also on work hours per week they are allowed to spend on novel "off the main path" research. So in order for there to be a big surge in AI R&D there'd need to be prioritization of that at a high level. This would be a change of direction from focusing primarily on scaling current techniques rapidly, and putting out slightly better products ASAP. So yes, if you think that this priority shift won't happen, then you should doubt that the increase in R&D speed my model predicts will occur. But what would that world look like? Probably a world where scaling continues to pay dividends, and getting to AGI is more straightforward yhan Steve Byrnes or I expect. I agree that that's a substantial probability, but it's also an AGI-soon sort of world. I argue that for AGI to be not-soon, you need both scaling to fail and for algorithm research to fail.

It's more that any platform that allows discussion of politics risks becoming a platform that is almost exclusively about politics. Upvoting is a signal of "I want to see more of this content", while downvoting is a signal of "I want to see less of this content". So "I will downvote any posts that are about politics or politics-adjacent, because I like this website and would be sad if it turned into yet another politics forum" is a coherent position.

All that said, I also did not vote on the above post.

I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a "mostly empty" layer. E,g, for feature ablation, setting the W_O and b_O params of the attention heads of your inserted layer to 0 to make sure that the attention heads don't change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximati... (read more)

Admittedly this sounds like an empirical claim, yet is not really testable, as these visualizations and input-variable-to-2D-space mappings are purely hypothetical

 

Usually not testable, but occasionally reality manages to make it convenient to test something like this fun paper.

Fig. 1. The boundary between trainable and untrainable neural network hyperparameters is fractal, for all experimental conditions. Images show a 2d grid search over neural network hyperparameters. For points shaded red, training diverged. For points shaded blue, training converged. Paler points correspond to faster convergence or divergence. Experimental conditions include different network nonlinearities, both minibatch and full batch training, and grid searching over either training or initialization hyperparameters. See Section II-D for details. Each image is a hyperlink to an animation zooming into the corresponding fractal landscape (to the depth at which float64 discretization artifacts appear). Experimental code, images, and videos are available at https://github.com/Sohl-Dickstein/fractal

If you have a bunch of things like this, rather than just one or two, I bet rejection sampling gets expensive pretty fast - if you have one constraint which the model fails 10% of the time, dropping that failure rate to 1% brings you from 1.11 attempts per success to 1.01 attempts per success, but if you have 20 such constraints that brings you from 8.2 attempts per success to 1.2 attempts per success.

Early detection of constraint violation plus substantial infrastructure around supporting backtracking might be an even cheaper and more effective solution, though at the cost of much higher complexity.

5Sam Marks
Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model's performance, include performance at following the other constraints). (Note that using SAEs to fix lots of behaviors might also have additional downsides, since you're doing a more heavy-handed intervention on the model.)

I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn't be aligned.

That would also explain why the culture was so heavily concentrated in tumblr (where the way to express anger and disagreement with a post is to share the post to all your followers with your comment tacked on to the end) and later twitter (same, but now the platform also notices you engage with stuff that makes you angrily reply and shows you more of it).

Here’s a hypothetical ‘gold standard’ test: we do a big randomized controlled trial to see if a bunch of non-experts can actually create a (relatively harmless) virus from start to finish. Half the people would have AI mentors and the other half can only look stuff up on the internet. We’d give each participant $50K and access to a secure wet-lab set up like a garage lab, and make them do everything themselves: find and adapt the correct protocol, purchase the necessary equipment, bypass any know-your-customer checks, and develop the tacit skills needed t

... (read more)
1LucaRighetti
Then that seems bad, but also that AI is not counterfactual -- so adding safeguards to models is probably not the way to get the risk down.

It sure would.

Would you give 3:1 odds against a major player showing the untrusted model the old CoT, conditional on them implementing a "replace suspicious outputs" flow on a model with hidden CoT and publicizing enough implementation details to determine whether or not they did that?

6Zach Stein-Perlman
I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can't infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary). I'm over 75% that a lab would do (1), conditional on doing a "replace suspicious outputs" thing (but under 75% they'd say they're doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I'm under 75% that a lab would do (2).
faul_snameΩ240

If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.

Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.

6Zach Stein-Perlman
It would be crazy to show the untrusted model the old CoT when resampling.
BuckΩ111716

It would be crazy to hide that CoT from the monitor.

I note that if software developers used that logic for thinking about software security, I expect that almost all software in the security-by-obscurity world would have many holes that would be considered actual negligence in the world we live in.

This suggests that the threat model isn't so much "very intelligent AI" as it is "very cheap and at least somewhat capable robots".

2tailcalled
Kind of, though we already have mass production for some things, and it hasn't lead to the end of the humanity, partly because someone has to maintain and program those robots. But obesity rates have definitely skyrocketed, presumably partly because of our very cheap and somewhat capable robots.

“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“

I ran this prompt but substituted in "Gary Marcus" for "Eliezer Yudkowsky". Claude says

Overall evaluation: On a scale from 'mostly true' to 'mostly false,' I would rate Gary Marcus's predictions as a group as "Mostly True."

Many of Marcus's predi

... (read more)

Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.

If you're referring to the chain of thought summaries you see when you select the o1-preview model in chatgpt, those are not the full chain of thought. Examples of the actual chain-of-thought can be found on the learning to reason with LLMs page with a few more examples in the o1 system card. Note that we are going off of OpenAI's word that these... (read more)

9gwern
If you distrust OA's selection, it seems like o1 is occasionally leaking the chains of thought: https://www.reddit.com/r/OpenAI/comments/1fxa6d6/two_purported_instances_of_o1preview_and_o1mini/ So you can cross-reference those to see if OA's choices seem censored somehow, and also just look at those as additional data. It's also noteworthy that people are reporting that there seem like there are other blatant confabulations in the o1 chains, much more so than simply making up a plausible URL, based on the summaries: https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/ Stuff which makes no sense in context and just comes out of nowhere. (And since confabulation seems to be pretty minimal in summarization tasks these days - when I find issues in summaries, it's usually omitting important stuff rather than making up wildly spurious stuff - I expect those confabulations were not introduced by the summarizer, but were indeed present in the original chain as summarized.)

Shutting Down all Competing AI Projects is not Actually a Pivotal Act

This seems like an excellent title to me.

Technically this probably isn't recursive self improvement, but rather automated AI progress. This is relevant mostly because

  1. It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
  2. It means that multi-agent dynamics will be very rele
... (read more)

No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it's not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.

My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.

Do you expect that the first takeover-capable ASI / the first sufficiently-internally-cooperative-to-be-takeover-capable group of AGIs will follow this style of reasoning pattern? And particularly the first ASI / group of AGIs that actually make the attempt.

1gb
That’s a great question. If it turns out to be something like an LLM, I’d say probably yes. More generally, it seems to me at least plausible that a system capable enough to take over would also (necessarily or by default) be capable of abstract reasoning like this, but I recognize the opposite view is also plausible, so the honest answer is that I don’t know. But even if it is the latter, it seems that whether or not the system would have such abstract-reasoning capability is something at least partially within our control, as it’s likely highly dependent on the underlying technology and training.

Yeah, my argument was "this particular method of causing actual human extinction would not work" not "causing human extinction is not possible", with a side of "agents learn to ignore adversarial input channels and this dynamic is frequently important".

It does strike me that, to OP's point, "would this act be pivotal" is a question whose answer may not be knowable in advance. See also previous discussion on pivotal act intentions vs pivotal acts (for the audience, I know you've already seen it and in fact responded to it).

If an information channel is only used to transmit information that is of negative expected value to the receiver, the selection pressure incentivizes the receiver to ignore that information channel.

That is to say, an AI which makes the most convincing-sounding argument for not reproducing to everyone will select for those people who ignore convincing-sounding arguments when choosing whether to engage in behaviors that lead to reproduction.

4Nathan Helm-Burger
Yeah, but... Selection effects, in an evolutionary sense, are relevant over multiple generations. The time scale of the effects we're thinking about are less than the time scale of a single generation. This is less of a "magic attack that destroys everyone" and more of one of a thousand cuts which collectively bring down society. Some people get affected by arguments, others by distracting entertainment, others by nootropics that work well but stealthily have permanent impacts on fertility, some get caught up in terrorist attacks by weird AI-led cults.... Just, a bunch of stuff from a bunch of angles.
4Nathan Helm-Burger
Yeah. I don't actually think that a persuasive argument targeted to every single human is an efficient way for a superintelligent AI to accomplish its goals in the world. Someone else mentioned convincing the most gullible humans to hurt the wary humans. If the AI's goal was to inhibit human reproduction, it would be simple to create a bioweapon to cause sterility without killing the victims. Doesn't take very many loyally persuaded humans to be the hands for a mission like that.

... is it possible to train a simple single-layer model to map from residual activations to feature activations? If so, would it be possible to determine whether a given LLM "has" a feature by looking at how well the single-layer model predicts the feature activations?

Obviously if your activations are gemma-2b-pt-layer-6-resid-post and your sae is also on gemma-2b-pt-layer-6-pt-post your "simple single-layer model" is going to want to be identical to your SAE encoder. But maybe it would just work for determining what direction most closely maps to the acti... (read more)

1Taras Kutsyk
Let me make sure I understand your idea correctly: 1. We use a separate single-layer model (analogous to the SAE encoder) to predict the SAE feature activations 2. We train this model on the SAE activations of the finetuned model (assuming that the SAE wasn't finetuned on the finetuned model activations?) 3. We then use this model to determine "what direction most closely maps to the activation pattern across input sequences, and how well it maps". I'm most unsure about the 2nd step - how we train this feature-activation model. If we train it on the base SAE activations in the finetuned model, I'm afraid we'll just train it on extremely noisy data, because feature activations essentially do not mean the same thing, unless your SAE has been finetuned to appropriately reconstruct the finetuned model activations. (And if we finetune it, we might just as well use the SAE and feature-universality techniques I outlined without needing a separate model).

Although watermarking the meaning behind text is currently, as far as I know, science fiction.

Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:

 

You can see that the watermark appears to be encoded in the "texture" of the image, but in a way where that texture doesn't look like the texture of anything in particular - rather, it's just that a random direction in latent space usually looks like a texture, ... (read more)

1egor.timatkov
Wow. This is some really interesting stuff. Upvoting your comment.

If your text generation algorithm is "repeatedly sample randomly (at a given temperature) from a probability distribution over tokens", that means you control a stream of bits which don't matter for output quality but which will be baked into the text you create (recoverably baked in if you have access to the "given a prefix, what is the probability distribution over next tokens" engine).

So at that point, you're looking for "is there some cryptographic trickery which allows someone in possession of a secret key to determine whether a stream of bits has a s... (read more)

1egor.timatkov
Yes, so indexing all generations is absolutely a viable strategy, though like you said, it might be more expensive. Watermarking by choosing different tokens at a specific temperature might not be as effective (as you touched on), because in order to reverse that, you need the exact input. Even a slight change to the input or the context will shift the probability distribution over the tokens, after all. Which means you can't know if the LLM chose the first or second or third most probable token just by looking at the token. That being said, something like this can still be used to watermark text: If the LLM has some objective, text-independent criteria for being watermarked (like the "e" before "a" example, or perhaps something more elaborate created using gradient descent), then you can use an LLM's loss function to choose some middle ground between maximizing your independent criteria and minimizing the loss function. The ideal watermark would put markings into the meaning behind the text, not just the words themselves. No idea how that would happen, but in that way you could watermark an idea, and at that point hacks like "translate to French and back" won't work. Although watermarking the meaning behind text is currently, as far as I know, science fiction.

Super interesting work!

I like the idea of seeing if there are any features from the base model which are dead in the instruction-tuned-and-fine-tuned model as a proxy for "are there any features which fine-tuning causes the fine-tuned model to become unable to recognize". Another related question also strikes me as interesting, which is whether an SAE trained on the instruction-tuned model has any features which are dead in the base model - these might represent new features that the instruction-tuned model learned, which in turn might give some insight in... (read more)

5Taras Kutsyk
Agreed, but I think our current setup is too limited to capture this. If we’re using the same “base SAE” for both the base and finetuned models, the situation like the one you described really implies “now this feature from the base model has a different vector (direction) in the activation space OR this feature is no longer recognizable”. Without training another SAE on the finetuned model, we have no way to tell the first case from the 2nd one (at least I don’t see it). This is indeed perhaps even more interesting, and I think the answer depends on how you map the features of the SAE trained on the instruction-tuned model to the features in the base model. If you do it naively by taking the feature (encoder) vector from the SAE trained on the instruction-tuned model (like it’s done in the post) and use it in the base model to check the feature’s activations, then once again you have a semi-decidable problem: either you get a high activation similarity (i.e. the feature has roughly the same activation pattern) indicating that it is present in the base model, OR you get something completely different: zero activation, ultralow density activation etc. And my intuition is that even “zero activation” case doesn’t imply that the feature “wasn’t there” in the base model, perhaps it just had a different direction! So I think it’s very hard to make rigorous statements of which features “become dead” or “are born” by the finetuning process using only a single SAE. I imagine it would be possible to do using two different SAEs trained on the base and finetuned models separately, and then studying which features they learned are "the same features" using Anthropic’s feature university techniques (like the activation similarity or the logits similarity that we used).  This would avoid uncertainties I mentioned by using model-independent feature representations like activation vectors. For example, if you find a feature X in the finetuned model (from the SAE trained on the fi

None of it would look like it was precipitated by an AI taking over.

But, to be clear, in this scenario it would in fact be precipitated by an AI taking over? Because otherwise it's an answer to "humans went extinct, and also AI took over, how did it happen?" or "AI failed to prevent human extinction, how did it happen?"

4Shmi
Any of those. Could be some kind of intentionality ascribed to AI, could be accidental, could be something else.
  1. It uses whatever means to eliminate other countries nuclear arsenals

I think the part where other nations just roll with this is underexplained.

1RussellThor
Yes for sure. I don't know how it would play out, and am skeptical anyone could. We can guess scenarios.  1. The most easily imagined one is the Pebbles owner staying in their comfort zone and not enforcing #2 at all. Something similar already happened - the USA got nukes first and let others catch up. In this case threatened nations try all sorts of things, political, commercial/trade, space war, arms race but don't actually start a hot conflict.  The Pebbles owner is left not knowing whether their system is still effective, nor the threatened countries - an unstable situation. 2. The threatened nation tries to destroy the pebbles with non-nuke means. If this was Russia, USA maybe could regenerate the system faster than Russia could destroy satellites. If its China, then lets say its not. The USA then needs to decide whether to strike the anti-satellite ground infrastructure to keep its system... 3. The threatened nation such as NK just refuses to give up nukes - in this case I can see USA destroying it. 4. India or Israel say refuses to give up their arsenal - I have no idea what would happen then.
Answer by faul_sname4112

Looking at the eSentire / Cybersecurity Ventures 2022 Cybercrime Report that appears to be the source of the numbers Google is using, I see the following claims:

  • $8T in estimated unspecified cybercrime costs ("The global annual cost of cybercrime is predicted to reach $8 trillion annually in 2023" with no citation of who is doing the estimation)
  • $20B in ransomware attacks in 2021 (source: a Cybersecurity Ventures report)
  • $30B in "Cryptocrime" which is e.g. cryptocurrency scams / rug pulls (source: another Cybersecurity Ventures report)

It appears to me t... (read more)

uspect[faul_sname] Humans do seem to have strong preferences over immediate actions.

[Jeremy Gillen] I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.

Point of clarification: the type of strong preferences I'm referring to are more deontological-injunction shaped than they are habit-shaped. I expect that a preference not to exhibit the behavior of murdering people would not meaningfully hinder someone whose goal was to ge... (read more)

1Jeremy Gillen
Yeah I'm on board with deontological-injunction shaped constraints. See here for example. Nah I still disagree. I think part of why I'm interpreting the words differently is because I've seen them used in a bunch of places e.g. the lightcone handbook to describe the lightcone team. And to describe the culture of some startups (in a positively valenced way). Being willing to be creative and unconventional -- sure, but this is just part of being capable and solving previously unsolved problems. But disregarding conventions that are important for cooperation that you need to achieve your goals? That's ridiculous.  Being willing to impose costs on uninvolved parties can't be what is implied by 'going hard' because that depends on the goals. An agent that cares a lot about uninvolved parties can still go hard at achieving its goals. Unfortunately we are not. I appreciate the effort you put into writing that out, but that is the pattern that I understood you were talking about, I just didn't have time to write out why I disagreed. This is the main point where I disagree. The reason I don't buy the extrapolation is that there are some (imo fairly obvious) differences between current tech and human-level researcher intelligence, and those differences appear like they should strongly interfere with naive extrapolation from current tech. Tbh I thought things like o1 or alphaproof might cause the people who naively extrapolate from LLMs to notice some of these, because I thought they were simply overanchoring on current SoTA, and since the SoTA has changed I thought they would update fast. But it doesn't seem to have happened much yet. I am a little confused by this. I didn't say likely, it's more an example of an issue that comes up so far when I try to design ways to solve other problems. Maybe see here for instabilities in trained systems, or here for more about that particular problem. I'm going to drop out of this conversation now, but it's been good, thanks! I thi

Lo and behold, this poor choice of ontology doesn’t work very well; the modeler requires a huge amount of complexity to decently represent the real-world system in their poorly-chosen ontology. For instance, maybe they need a ridiculously large decision tree or random forest to represent a neural net to decent precision.

That can happen because your choice of ontology was bad, but it can also be the case that representing the real-world system with "decent" precision in any ontology requires a ridiculously large model. Concretely, I expect that this is t... (read more)

Any agent which thinks it is at risk of being seen as cooperate-bot and thus fine to defect against in the future will be more wary of trusting that ASI.

[habryka] The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.

I think there will be options that are good under most of the things that "preferences for weak agents" ... (read more)

I like this exchange and the clarifications on both sides.

Yeah, it feels like it's getting at a crux between the "backchaining / coherence theorems / solve-for-the-equilibrium / law thinking" cluster of world models and the "OODA loop / shard theory / interpolate and extrapolate / toolbox thinking" cluster of world models.

You're right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it's multiple timeslices, or an integral over time, or lar

... (read more)
2Jeremy Gillen
I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals. Yeah same. Although legible commitments or decision theory can serve the same purpose better, it's probably harder to evolve because it depends on higher intelligence to be useful. The level of transparency of agents to each other and to us seems to be an an important factor. Also there's some equilibrium, e.g. in an overly honest society it pays to be a bit more dishonest, etc. It does unfortunately seem easy and useful to learn rules like honest-to-tribe or honest-to-people-who-can-tell or honest-unless-it's-really-important or honest-unless-I-can-definitely-get-away-with-it.  I think if you remove "at any cost", it's a more reasonable translation of "going hard". It's just attempting to achieve a long-term goal that is hard to achieve. I'm not sure what "at any cost" adds to it, but I keep on seeing people add it, or add monomaniacally, or ruthlessly. I think all of these are importing an intuition that shouldn't be there. "Going hard" doesn't mean throwing out your morality, or sacrificing things you don't want to sacrifice. It doesn't mean being selfish or unprincipled such that people don't cooperate with you. That would defeat the whole point. Yes! No! Yeah mostly true probably. I'm talking about stability properties like "doesn't accidentally radically change the definition of its goals when updating its world-model by making observations". I agree properties like this don't seem to be on the fastest path to build AGI.
Load More