User Comment Replies

I'm not aware of previous papers doing this but surely someone tried this before, I would welcome comments pointing to existing literature on this!
...
Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such "right there" probes^[1] would be awesome to assist LLM-based monitor

If I remember correctly, they are doing something like that in this paper here:

3.3 EXACT ANSWER TOKENS
Existing methods often overlook a critical nuance: the token selection for error detection, typically
f

... (read more)

2StefanHex16d

Thanks for the link, I hadn't noticed this paper! They show that when you choose one position to train the probes on, choosing the exact answer position (last token of the answer of multi-token) gives the strongest probe. After reading the section I think they (unfortunately) do not train a probe to classify every token.[1] Instead the probe is exclusively trained on exact-answer tokens. Thus I (a) expect their probe scores will not be particularly sparse, and (b) to get good performance you'll probably need to still identify the exact answer token at test time (while in my appendix C you don't need that). This doesn't matter much for their use-case (get good accuracy), but especially (a) does matter a lot for my use-case (make the scores LLM-digestible). Nonetheless this is a great reference, I'll edit it into the post, thanks a lot! 1. ^ For every sensible token position (first, exact answer, last etc.) they train & evaluate a probe on that position, but I don't see any (training or) evaluation of a single probe run on the whole prompt. They certainly don't worry about the probe being sparse (which makes sense, it doesn't matter at all for their use-case).

There should be more AI safety orgs

Walter Laurito 2y*107

and Kaarel’s work on DLK

@Kaarel is the research lead at Cadenza Labs (previously called NotodAI), our research group which started during the first part of SERI MATS 3.0 (There will be more information about Cadenza Labs hopefully soon!)

Our team members broadly agree with the post!

Currently, we are looking for further funding to continue to work on our research agenda. Interested funders (or potential collaborators) can reach out to us at info@cadenzalabs.org.

6Marius Hobbhahn2y

Nice to see you're continuing!

AI Safety Needs Great Engineers

Walter Laurito 3y10

Should work again :)

AI Safety Needs Great Engineers

Walter Laurito 3y*20

I've created a discord for the people interested in organizing / collaborating / self-study: https://discord.gg/Ckj4BKUChr People could start with the brief curriculum published in this document, until a full curriculum might be available :)

2Alex_Altair3y

FYI That invite link has now expired!

AI Safety Needs Great Engineers

Walter Laurito 3y50

Maybe, we could also send out an invitation to all the people who got rejected to join a Slack channel. (I could set that up, if necessary. Since I don't have the emails, though, someone would need to send the invitations). There, based on the curriculum, people could form self-study groups on their own with others close-by (or remotely) and talk about difficulties, bugs, etc. Maybe, even the people who got not rejected could join the slack and help to answer questions (if they like and have time, of course)?

2Walter Laurito 3y

AI Safety Needs Great Engineers

Walter Laurito 3y*30

Same here (Not sure yet if I get accepted to AISC though). But I would be happy with helping or co-organizing something like Richard_Ngo suggested. (Although I've never organized something like that before) Maybe a virtual version in (Continental?) Europe, if there are enough people

5Walter Laurito 3y

Maybe, we could also send out an invitation to all the people who got rejected to join a Slack channel. (I could set that up, if necessary. Since I don't have the emails, though, someone would need to send the invitations). There, based on the curriculum, people could form self-study groups on their own with others close-by (or remotely) and talk about difficulties, bugs, etc. Maybe, even the people who got not rejected could join the slack and help to answer questions (if they like and have time, of course)?

1[comment deleted]3y

LESSWRONG
LW

All of Walter Laurito 's Comments + Replies