User Comment Replies

Training on Documents About Reward Hacking Induces Reward Hacking

Regarding how pre-training affects preferences of a model:

We can keep asking the same thing (sychophancy or something else) at different steps in the training and see how the model answers it to see how its preferences change over steps, you can also go back and see what data did it see in between the steps to get some causal linkages if possible.
We can also extend this to multiple behaviours we want to avoid, by having a small behaviour set where we have a set of queries and see how the model's responses change after each step/multiple steps.

How we ... (read more)

3Nathan Hu1mo

Apologies for the slow response on this. There was an issue with the link - this should link to the files with the correct access permissions https://drive.google.com/drive/folders/1QUwJTIqwYH2eskaoRtDRgnMt0YHtyDYA.

Won't vs. Can't: Sandbagging-like Behavior from Claude Models

asksathvik2mo10

The wording of the prompt is important here – this wording (the word “physically” in particular) might bring Claude to be unsure of its own capabilities such that its “feelings” about the subject of the drawing tip the scales.

Is this self-awareness?
Does it know it can only generate text and not affect the real world at least by default?

Claude computer use example shows how far claude can take things when it thinks it should not do something, rather than simply being honest about it. How do you think it learnt it this? Being honest about it seems much more easier to learn.

Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

asksathvik2mo10

I just completed reading this paper (54 pages!!)
I have a few suggestions and need some clarifications:

paper:

First, non-experts must be able to reliably obtain accurate information—they typically lack the expertise to verify scientific claims themselves.

My comment:

You need to try red teamers who are educated in chemistry to various levels, undergrad, grad and PhD, the non-expert trying to hack you will be someone who knows a decent amount of the field.

They can try to do what claude said and comeback with feedback.

paper:

Helpful-only

asksathvik2mo10

Why "complete" control? You can disprove anything , in a fake sort of way, by setting then bar high -- if you define memory as Total Recall, it turns out no-one has a memory.
Who's this "you" who's separate from both brain and body? Shouldn't you be asking how the machine works? A machine doesn't have to be deterministic , and can be self-modifying.

What I meant was the consciousness part of your brain, the "you" who wants to do something. Its your ego.

Machine can be both deterministic and self modifying, its deterministic when doing inference and modifying ... (read more)

2TAG1mo

"Algorithms" and "purely random" are nowhere near the only options. >If I ask you to do 2+2, a 100 times, you would always respond 4, What if you ask.me for a random number? >We don't have perfect predicatability in psychology because we don't understand it yet You also need physical.determinism to.be true. But determimism isn't a fact

Humans are Just Self Aware Intelligent Biological Machines

asksathvik2mo10

I specifically mentioned wife instead of a generic friends specifically due to this reason, I have been with her 7.5 years now and we have grown together, and I have a good understanding of what she likes and how those likes are changing or are constant.
Why do you need god for this? If we sufficiently understand how the brain and body works we should be able to predict.
Anyways the post isn't about if we can do this now or in the future, its about how humans are just doing computation and nothing more, and how similar is that to a powerful AI doing computation, so that we can have a more unified view on what's conscious and what has agency.

Humans are Just Self Aware Intelligent Biological Machines

asksathvik2mo*10

I need to definitely educate myself on chaos theory and quantum mechanics, but as mentioned in normality unaffected you linked above, and my comment above, we (humans) seems to be very predictable atleast in the short term, and if you have the exact body state and the algorithms it runs you can predict what we will do in the next moment given an input.

I didn't look into what Sam Harris said but based on my involvement with Robert's books and videos, my interest in this is that, this way of looking at things makes us come out of the human exceptionalism arg... (read more)

Humans are Just Self Aware Intelligent Biological Machines

asksathvik2mo10

We need more than your exact body state, we need the algorithms your body and brain are running as well which I have mentioned in the post.

If the algorithm includes randomness, we will use it. As far as we can observe we are fairly predictable in the short term so it makes sense that whatever randomness exists at the quantum level is corrected when we get to the final decisions, any unpredictability could be due to other inputs we aren't aware of, like stress, hunger, health issues etc.

Thanks for the pointer to chaos theory, I will look more into it, I nee... (read more)

By default, capital will matter more than ever after AGI

asksathvik4mo10

But why be so nihilistic about this? We can strive to conquer the solar system, the galaxy and the universe. Strive to understand why does the universe exist? Those seem pretty important to be worked on.

LESSWRONG
LW

All of asksathvik's Comments + Replies