User Comment Replies

Shallow review of live agendas in alignment & safety

So update on this, I got busy with applications this last week and forgot to mail them about this but I just got a mail from ligthpeed saying I'm going to get a grant because Jaan Tallinn, has increased the amount he is distributing through Lightspeed Grants. (thought they say that "We have not yet received the money, so delays of over a month or even changes in amount seem quite possible")
(Edit: for the record I did end up getting funded).

Shallow review of live agendas in alignment & safety

Victor Levoso1y30

Yeah Stag told me that's where they saw it.But I'm confused about what that means?
I certainly didn't get money from lighstpeed, I applied but got mail saying I wouldn't get a grant.
I still have to read on what that is but it says "recomendations" so it might not necesarily mean those people got money or something?.
I might have to just mail them to ask I guess, unless after reading their faq more deeply about what this S-process is it becomes clear whats up with that.

3technicalities1y

The story I heard is that Lightspeed are using SFF's software and SFF jumped the gun in posting them and Lightspeed are still catching up. Definitely email.

Shallow review of live agendas in alignment & safety

Victor Levoso1y50

Just waited to point out that my algorithm distillation thing didn't actually get funded by ligthspeed and I have in fact received no grant so far(while the post says I have 68k for some reason? might be getting mixed up with someone else).
I'm also currently working on another interpretability project with other people that will be likely published relatively soon.
But my resources continue being 0$ and haven't managed to get any grant yet.

3technicalities1y

Interesting. I hope I am the bearer of good news then

Creating a Discord server for Mechanistic Interpretability Projects

Victor Levoso2y10

Yeah basically this, the Eleuther discord is nice but its also not intended for small mechanistic interpretability projects like this, or at least they weren't happening there.
Plus eleuther is also "not the place to ask for technical support or beginner questions" while for this server I think it would be nice if it becomes a place where people share learning resources and get advice and ideas for their MI projects and that kind of thing.
Not sure how well its going to work out but if at least some projects end up happening that wouldn't have happened otherwise I'll consider it a success even if the server dies out and they go on to work on their own private slacks or discords or whatever.

Creating a Discord server for Mechanistic Interpretability Projects

Victor Levoso2y40

Exactly, It's already linked on the project ideas channel of the discord server.
Part of the reason I wanted to do this is that It seems to me that there's a lot of things of that list that people could be working on, and apparently there's a lot of people who want to work on MI going by number of people that applied to the Understanding Search in Transformers project in AI safety camp, and whats missing is some way of taking those people and get them to actually work on those projects.

Inverse Scaling Prize: Second Round Winners

Victor Levoso2y10

I think its also not obvious how it solves the problem, whether its about the model only being capable of doing the reasoning required using multiple steps(though why the inverse scale then) or something more like writing an explanation makes the model more likely to use the right kind of reasoning.

And inside of that second option there's a lot of ways that could work internally whether its about distributions of kinds of humans it predicts, or something more like different circuits being activated in different contexts in a way that doesn't have to ... (read more)

Inverse Scaling Prize: Second Round Winners

Victor Levoso2y*30

I want to note that a lot of the behaviors found in the inverse scaling price do in fact disappear by just adding "Lets think step by step". Imagen I already tested this a bit a few months ago in apart research's hackaton along whit a few other people https://itch.io/jam/llm-hackathon/rate/1728566 and migtht try to do it more rigorously for all the entries now that all of the winners have been announced(plus I was procrastinating on it and this is a good point to actually get around doing it)

Also another thing to note is that chatgpt shows the same be... (read more)

5gwern2y

I was wondering whether to comment on how to take a demonstration of inner-monologue or alternate prompting approaches solving the problems... There's definitely a bunch of different ways you can interpret that outcome. After all, even if you solve it with a better prompt, the fact remains that they demonstrated inverse scaling on the original prompt. So what does that mean? I guess that depends on what you thought inverse scaling was. One way is to take the inverse-scaling as a sub-category of hidden scaling: it 'really' was scaling, and your 'bad prompts' just masked the hidden scaling; it had the capability and 'sampling can show the presence of knowledge but not the absence', and the Contest has been useful primarily in experimentally demonstrating that skilled ML professionals can be hoodwinked into severely underestimating the capabilities of powerful DL models, which has obvious AI safety implications.

200 COP in MI: Interpreting Reinforcement Learning

Victor Levoso2y*61

Another idea that could be interesting for decision transformers is figuring out what is going on in this paper https://arxiv.org/pdf/2201.12122.pdf
Also I can confirm that at least on the hopper environment training a 1L DT works https://api.wandb.ai/report/victorlf4/jvuntp8l

Maybe it does for the bigger environments haven't tried yet.
https://github.com/victorlf4/decision-transformer-interpretability here's the fork I made of the decision transformer code to save models in case someone else wants to do it to save them some work.
(I used the original co... (read more)

LESSWRONG
LW

All of Victor Levoso's Comments + Replies