Kenoubi - LessWrong

I have very mixed feelings about this comment. It was a good story (just read it, and wouldn't have done so without this comment) but I really don't see what it has to do with this LW post.

The Most Forbidden Technique

Kenoubi12d10

Possible edge case / future work - what if you optimize for faithfulness and legibility of the chain of thought? The paper tests optimizing for innocent-looking CoT, but if the model is going to hack the test either way, I'd want it to say so! And if we have both an "is actually a hack" detector and a "CoT looks like planning a hack" detector, this seems doable.

Is this an instance of the Most Forbidden Technique? I'm not sure. I definitely wouldn't trust it to align a currently unaligned superintelligence. But it seems like maybe it would let you make an aligned model at a given capability level into a still aligned model with more legible CoT, without too much of a tax, as long as the model doesn't really need illegible CoT to do the task? And if capabilities collapse, that seems like strong evidence that illegible CoT was required for task performance; halt and catch fire, if legible CoT was a necessary part of your safety case.

Monthly Roundup #28: March 2025

Kenoubi12d10

Is it really plausible that human driver inattention just doesn't matter here? Sleepiness, drug use, personal issues, eyes were on something interesting rather than the road, etc. I'd guess something like that is involved in a majority of collisions, and that Just Shouldn't Happen to AI drivers.

Of course AI drivers do plausibly have new failure modes, like maybe the sensors fail sometimes (maybe more often than human eyes just suddenly stop working). But there should be plenty of data about that sort of thing from just testing them a lot.

The only realistic way I can see for AI drivers, that have been declared street-legal and are functioning in a roadway and regulatory system that humans (chose to) set up, to be less safe than human drivers is if there's some kind of coordinated failure. Like if they trust data coming from GPS satellites or cell towers, and those start spitting out garbage and throw the AIs off distribution; or a deliberate cyber-attack / sabotage of some kind.

Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3)

Kenoubi22d10

“ goal” in “football| goal|keeping”

Looks like an anti-football (*American* football, that is) thing, to me. American football doesn't have goals, and soccer (which is known as "football" in most of the world) does. And you mentioned earlier that the baseball neuron is also anti-football.

Literature Review of Text AutoEncoders

Kenoubi25d10

Since it was kind of a pain to run, sharing these probably minimally interesting results. I tried encoding this paragraph from my comment:

I wonder how much information there is in those 1024-dimensional embedding vectors. I know you can jam an unlimited amount of data into infinite-precision floating point numbers, but I bet if you add Gaussian noise to them they still decode fine, and the magnitude of noise you can add before performance degrades would allow you to compute how many effective bits there are. (Actually, do people use this technique on latents in general? I'm sure either they do or they have something even better; I'm not a supergenius and this is a hobby for me, not a profession.) Then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are computed (they say 512 tokens of context but I haven't looked at the details enough to know if there's a natural way to encode more tokens than that; I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.

with SONAR, breaking it up like this:

sentences = [
    'I wonder how much information there is in those 1024-dimensional embedding vectors.',
    'I know you can jam an unlimited amount of data into infinite-precision floating point numbers, but I bet if you add Gaussian noise to them they still decode fine, and the magnitude of noise you can add before performance degrades would allow you to compute how many effective bits there are.',
    '(Actually, do people use this technique on latents in general? I\'m sure either they do or they have something even better; I\'m not a supergenius and this is a hobby for me, not a profession.)',
    'Then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are computed (they say 512 tokens of context but I haven\'t looked at the details enough to know if there\'s a natural way to encode more tokens than that;',
    'I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.']

and after decode, I got this:

['I wonder how much information there is in those 1024-dimensional embedding vectors.',
 'I know you can encode an infinite amount of data into infinitely precise floating-point numbers, but I bet if you add Gaussian noise to them they still decode accurately, and the amount of noise you can add before the performance declines would allow you to calculate how many effective bits there are.',
 "(Really, do people use this technique on latent in general? I'm sure they do or they have something even better; I'm not a supergenius and this is a hobby for me, not a profession.)",
 "And then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are calculated (they say 512 tokens of context but I haven't looked into the details enough to know if there's a natural way to encode more tokens than that;",
 'I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.']

Can we do semantic arithmetic here?

sentences = [
     'A king is a male monarch.',
     'A bachelor is an unmarried man.',
     'A queen is a female monarch.',
     'A bachelorette is an unmarried woman.'
]
...
pp(reconstructed)
['A king is a male monarch.',
 'A bachelor is an unmarried man.',
 'A queen is a female monarch.',
 'A bachelorette is an unmarried woman.']
...
new_embeddings[0] = embeddings[0] + embeddings[3] - embeddings[1]
new_embeddings[1] = embeddings[0] + embeddings[3] - embeddings[2]
new_embeddings[2] = embeddings[1] + embeddings[2] - embeddings[0]
new_embeddings[3] = embeddings[1] + embeddings[2] - embeddings[3]

reconstructed = vec2text_model.predict(new_embeddings, target_lang="eng_Latn", max_seq_len=512)
pp(reconstructed)
['A kingwoman is a male monarch.',
 "A bachelor's is a unmarried man.",
 'A bachelorette is an unmarried woman.',
 'A queen is a male monarch.']

Nope. Interesting though. Actually I guess the 3rd one worked?

OK, I'll stop here, otherwise I'm at risk of going on forever. But this seems like a really cool playground.

ParaScopes: Do Language Models Plan the Upcoming Paragraph?

Kenoubi1mo50

You appear to have two full copies of the entire post here, one above the other. I wouldn't care (it's pretty easy to recognize this and skip the second copy) except that it totally breaks the way LW does comments on and reactions to specific parts of the text; one has to select a unique text fragment to use those, and with two copies of the entire post, there aren't any unique fragments.

Literature Review of Text AutoEncoders

Kenoubi1mo80

Wow, the SONAR encode-decode performance is shockingly good, and I read the paper and they explicitly stated that their goal was translation, and that the autoencoder objective alone was extremely easy! (But it hurt translation performance, presumably by using a lot of the latent space to encode non-semantic linguistic details, so they heavily downweighted autoencoder loss relative to other objectives when training the final model.)

I wonder how much information there is in those 1024-dimensional embedding vectors. I know you can jam an unlimited amount of data into infinite-precision floating point numbers, but I bet if you add Gaussian noise to them they still decode fine, and the magnitude of noise you can add before performance degrades would allow you to compute how many effective bits there are. (Actually, do people use this technique on latents in general? I'm sure either they do or they have something even better; I'm not a supergenius and this is a hobby for me, not a profession.) Then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are computed (they say 512 tokens of context but I haven't looked at the details enough to know if there's a natural way to encode more tokens than that; I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.

Exploring this embedding space seems super interesting, in general, way more so on an abstract level (obviously it isn't as directly useful at this point) than the embedding space used by actual LLMs. Like, with only 1024 dimensions for a whole paragraph, it must be massively polysemantic, right? I guess your follow-on post (which this was just research to support) is implicitly doing part of this, but I think maybe it underplays "can we extract semantic information from this 1024-dimensional embedding vector in any way substantially more efficient than actually decoding it and reading the output?" (Or maybe it doesn't; I read the other post too, but haven't re-read it in light of this one.)

There also appears to be a way to attempt to use this to enhance model capabilities. I seem to think of one of these every other week, and again, I'm not a supergenius nor a professional ML researcher so I assume it's obvious to those in the field. The devil appears to be in the details; sometimes a new innovation appears to be a variant of something I thought of years ago, sometimes they come out of left field from my perspective, and in no case does there appear to be anything I, from my position, could have usefully done with the idea, so far. Experiments seem very compute-limited, especially because like all other software development in my experience, one needs to actually run the code and see what happens. This particular technique, if it actually works (I'm guessing either it doesn't, or it only works when scaled so large that a bunch of other techniques would have worked just as well and converged on the same implicit computations) might come with large improvements to interpretability and controllability, or it might not (which seems to be true for all the other ideas I have that might improve capabilities, too). I'm not advising anyone to try it (again, if one works in the field I think it's obvious, so either there are reasons not to or someone already is). Just venting, I guess. If anyone's actually reading this, do you think there's anything useful to do with this idea and others like it, or are they pretty much a dime a dozen, interesting to me but worthless in practice?

(Sorry for going on so long! Wish I had a way to pay a penny to anyone who thoughtfully reads this, whether or not they do anything with it.)

Requirements for a Basin of Attraction to Alignment

Kenoubi5mo10

Sorry, I think it's entirely possible that this is just me not knowing or understanding some of the background material, but where exactly does this diverge from justifying the AI pursuing a goal of maximizing the inclusive genetic fitness of its creators? Which clearly either isn't what humans actually want (there are things humans can do to make themselves have more descendants that no humans, including the specific ones who could take those actions, want to take, because of godshatter) or is just circular (who knows what will maximize inclusive genetic fitness in an environment that is being created, in large part, by the decision of how to promote inclusive genetic fitness?). At some point, your writing started talking about "design goals", but I don't understand why tools / artifacts constructed by evolved creatures, that happen to increase the inclusive genetic fitness of the evolved creatures who constructed them by means other than the design goals of those who constructed them, wouldn't be favored by evolution, and thus part of the "purpose" of the evolved creatures in constructing them; and this doesn't seem like an "error" even in the limit of optimal pursuit of inclusive genetic fitness, this seems to be just what optimal pursuit of IGF would actually do. In other words, I don't want a very powerful human-constructed optimizer to pursue the maximization of human IGF, and I think hardly any other humans do either; but I don't understand in detail why your argument doesn't justify AI pursuit of maximizing human IGF, to the detriment of what humans actually value.

Are the other Rationality: A-Z sequences coming out as books?

Answer by KenoubiOct 01, 202410

As the person who requested of MIRI to release the Sequences as paper books in the first place, I have asked MIRI to release the rest of them, and credibly promised to donate thousands of dollars if they did so. Given the current situation vis-a-vis AI, I'm not that surprised that it still does not appear to be a priority to them, although I am disappointed.

MIRI, if you see this, yet another vote for finishing the series! And my offer still stands!

"Wanting" and "liking"

Kenoubi6mo10

Thank you for writing this. It has a lot of stuff I haven't seen before (I'm only really interested in neurology insofar as it's the substrate for literally everything I care about, but that's still plenty for "I'd rather have a clue than treat the whole area as spooky stuff that goes bump in the night").

As I understand it, you and many scientists are treating energy consumption by anatomical part of the brain (as proxied by blood flow) as the main way to see "what the brain is doing". It seems possible to me that there are other ways that specific thoughts could be kept compartmentalized, e.g. which neurotransmitters are active (although I guess this correlates pretty strongly to brain region anyway) or microtemporal properties of neural pulses; but the fact that we've found any kind of reasonably consistent relationship between [brain region consuming energy] and [mental state as reported or as predicted by the situation] means that brain region is a factor used for separating / modularizing cognition, if not that it's the only such part. So, I'll take brain region = mental module for granted for now and get to my actual question:

Do you know whether anyone has compiled data, across a wide variety of experiments or other data-gathering opportunities, of which brain regions have which kinds of correlations with one another? E.g. "these two tend to be active simultaneously", "this one tends to become active just after this one", etc.

I'm particularly interested in this for the brain regions you mention in this article, those related in various senses to good and/or to bad. If one puts both menthol and capsaicin in one's mouth at the same time, the menthol will stimulate cold receptors and the capsaicin will stimulate heat receptors, and one will have an experience out of range of what the sensors usually encounter: hot and cold, simultaneously in the same location. What I actually want to know is: are good and bad (or some forms of them, anyway) also represented in a way where one isn't actually the opposite of the other, neurologically speaking? If so, are there actual cases that are clearly best described as "good and bad", where to pick a single number instead would inevitably miss the intensity of the experience?

LESSWRONG
LW

Posts

Wikitag Contributions

Comments