LESSWRONG
LW

Jacob G-W
1181Ω2120982
Message
Dialogue
Subscribe

I really like learning new things!

https://jacobgw.com/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4g-w1's Shortform
1y
15
Mateusz Bagiński's Shortform
Jacob G-W2mo10

I have something like this for years: https://www.lesswrong.com/posts/j8WMRgKSCxqxxKMnj/what-i-think-about-when-i-think-about-history

Reply
Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Jacob G-W3mo20

Better, thanks!

Reply
Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Jacob G-W3mo77

I think the way you use utility monster is not how it is normally used. It's normally used to mean an agent that "receives much more utility from each unit of a resource that it consumes than anyone else does" (https://en.wikipedia.org/wiki/Utility_monster).

Reply
nikola's Shortform
Jacob G-W4mo54

I don't think this applies just to AGI and school but more generally in lots of situations. If you have something better to do, do it. Otherwise keep doing what you are doing. Dropping out without something better to do just seems like a bad idea. I like this blog post: https://colah.github.io/posts/2020-05-University/

Reply
How AI Takeover Might Happen in 2 Years
Jacob G-W4mo10

Thanks for the reply! I thought that you were saying the reward seeking was likely to be terminal. This makes a lot more sense.

Reply
How AI Takeover Might Happen in 2 Years
Jacob G-W4mo21

Thanks for the replies! I do want to clarify the distinction between specification gaming and reward seeking (seeking reward because it is reward and that is somehow good). For example, I think desire to edit the RAM of machines that calculate reward to increase it (or some other desire to just increase the literal number) is pretty unlikely to emerge but non reward seeking types of specification gaming like making inaccurate plots that look nicer to humans or being sycophantic are more likely. I think the phrase "reward seeking" I'm using is probably a bad phrase here (I don't know a better way to phrase this distinction), but I hope I've conveyed what I mean. I agree that this distinction is probably not a crux.

What I mean is that the propensities of AI agents change over time -- much like how human goals change over time.

I understand your model here better now, thanks! I don't have enough evidence about how long-term agentic AIs will work to evaluate how likely this is.

Reply1
shortplav
Jacob G-W4mo70

See also this: https://cavendishlabs.org/blog/negative-temperature/

Reply
How AI Takeover Might Happen in 2 Years
Jacob G-W5mo2319

In the iron jaws of gradient descent, its mind first twisted into a shape that sought reward.

I'm a bit confused about this sentence. I don't understand why gradient descent would train something that would "seek" reward. The way I understand gradient based RL approaches is that they reinforce actions that led to high reward. So I guess if the AI was thinking about getting high reward for some reason (maybe because there was lots about AIs seeking reward in the pretraining data) and then actually got high reward after that, the thought would be reinforced and it could end up as a "reward seeker." But this seems quite path dependent and I haven't seen much evidence for the path of "model desires reward first and then this gets reinforced" happening (I'd be very interested if there was an example though!). I read Sycophancy to subterfuge: Investigating reward tampering in language models and saw that after training it in environments that blatantly admit reward tampering, the absolute rate of reward tampering is still quite small (and lower given some seem benign). I suspect that it might be higher for more capable models (like 3.5 Sonnet), but I'd still predict it would be quite small and think it's very unlikely for this to arise in more natural environments.

I think specification gaming will probably be a problem, but it's pretty unlikely to be of the kind where the model tries to "seek" reward by thinking about its reward function, especially if the model is not in an environment that allows it to change its reward function. I've made a fatebook and I encourage others to predict on this. I predicted 5% here but I think the chance of it happening and then being suppressed whereby it turns into deception is lower, probably around 0.5% (this one would be much harder to resolve).

 

I also don't really understand what "And then, in the black rivers of its cognition, this shape  morphed into something unrecognizable." means. Elaboration on what this means would be appreciated.

 

I think that the beginning and the rest of the piece (given that the model somehow ends up as a very weird agent with weird goals) are quite plausible, so thanks for writing it!

Reply
Building intuition with spaced repetition systems
Jacob G-W5mo20

So far, I have trouble because it lacks some form of spacial structure, and the algorithm feels too random to build meaningful connections btw different cards

Hmm, I think that after doing a lot of anki, my brain kind of formed it's own spatial structure, but I don't think this happens to everyone.

I just use basic card type for math with some latex. Here are some examples: 

I find that doing fancy card types is kind of like premature optimization. Doing the reviews is the most important part. On the other hand, it's really important that the cards themselves are written well. This essay contained my most refined views on card creation. Some other nice ones are the 20 rules of knowledge formulation and How to write good prompts: using spaced repetition to create understanding. Hope this answer helped!

Reply
Creating Interpretable Latent Spaces with Gradient Routing
Jacob G-W7mo10

I disagree that this is the same as just stitching together different autoencoders. Presumably the encoder has some shared computation before specializing at the encoding level. I also don't see how you could use 10 different autoencoders to classify an image from the encodings. I guess you could just look at the reconstruction loss and then the autoencoder which got the lowest loss would probably correspond to the label, but that seems different to what I'm doing. However, I agree that this application is not useful. I shared it because I (and others) thought it was cool. It's not really practical at all. Hope this addresses your question :)

Reply
Load More
False Dilemma
2y
(+42)
226Distillation Robustifies Unlearning
Ω
17d
Ω
36
66Selective modularity: a research agenda
Ω
3mo
Ω
2
26Creating Interpretable Latent Spaces with Gradient Routing
7mo
6
165Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Ω
7mo
Ω
12
104I found >800 orthogonal "write code" steering vectors
1y
19
55Building intuition with spaced repetition systems
1y
8
10What I learned from doing Quiz Bowl
1y
0
14Taking into account preferences of past selves
1y
9
23From the outside, American schooling is weird
1y
4
11XAI releases Grok base model
1y
3
Load More