Thane Ruthenis

Wiki Contributions

Comments

Sorted by

There's a general-purpose trick I've found that should, in theory, be applicable in this context as well, although I haven't mastered that trick myself yet.

Essentially: when you find yourself in any given cognitive context, there's almost surely something "visible" from this context such that understanding/mastering/paying attention to that something would be valuable and interesting.

For example, suppose you're reading a boring, nonsensical continental-philosophy paper. You can:

  • Ignore the object-level claims and instead try to reverse-engineer what must go wrong in human cognition, in response to what stimuli, to arrive at ontologies that have so little to do with reality.
  • Start actively building/updating a model of the sociocultural dynamics that incentivize people to engage in this style of philosophy. What can you learn about mechanism design from that? It presumably sheds light on how to align people towards pursuing arbitrary goals, or how to prevent this happening...
  • Pay attention to your own cognition. How exactly are you mapping the semantic content of the paper to an abstract model of what the author means, or to the sociocultural conditions that created this paper? How do these cognitive tricks generalize? If you find a particularly clever way to infer something form the text, check: would your cognitive policy automatically deploy this trick in all context where it'd be useful, or do you need to manually build a TAP for that?
  • Study what passages make the feelings of boredom or frustration spike. What does that tell you about how your intuitions/heuristics work? Could you extract any generalizable principles out of that? For example, if a given sentence particularly annoys you, perhaps it's because it features a particularly flawed logical structure, and it'd be valuable to learn to spot subtler instances of such logical flaws "in the wild".

The experience of reading the paper's text almost certainly provides some data uniquely relevant to some valuable questions, data you legitimately can't source any other way. (In the above examples: sure you can learn more efficiently about the author's cognition or the sociocultural conditions by reading some biographies or field overviews. But (1) this wouldn't give you the meta-cognitive data about how you can improve your inference functions for mapping low-level data to high-level properties, (2) those higher-level summaries would necessarily be lossy, and give you a more impoverished picture than what you'd get from boots-on-the-ground observations.)

Similar applies to:

  • Listening to boring lectures. (For example, you can pay intense attention to the lecturer's body language, or any tricks or flaws in their presentation.)
  • Doing a physical/menial task. (Could you build, on the fly, a simple model of the physics (or logistics) governing what you're doing, and refine it using some simple experiments? Then check afterwards if you got it right. Or: If you were a prehistoric human with no idea what "physics" is, how could you naturally arrive at these ideas from doing such tasks/making such observations? What does that teach you about inventing new ideas in general?)
  • Doing chores. (Which parts of the process can you optimize/streamline? What physical/biological conditions make those chores necessary? Could you find a new useful takeaway from the same chore every day, and if not, why?)

Et cetera.

There's a specific mental motion I associate with using this trick, which involves pausing and "feeling out" the context currently loaded in my working memory, looking at it from multiple angles, trying to see anything interesting or usefully generalizable.

In theory, this trick should easily apply to small-talk as well. There has to be something you can learn to track in your mind, as you're doing small-talk, that would be useful or interesting to you. 

One important constraint here is that whatever it is, it has to be such that your outwards demeanour would be that of someone who is enjoying talking to your interlocutor. If the interesting thing you're getting out of the conversation is so meta/abstract you end up paying most of the attention to your own cognitive processes, not on what the interlocutor is saying, you'll have failed at actually doing the small-talk. (Similarly, if, when doing a menial task, you end up nerd-sniped by building a physical model of the task, you'll have failed at actually doing the task.)

You also don't want to come across as sociopathic, so making a "game" of it where you're challenging yourself to socially engineer the interlocutor into something is, uh, not a great idea.

The other usual advice for finding ways to enjoy small-talk are mostly specialized instances of the above idea that work for specific people. Steering the small-talk to gradient-descend towards finding emotional common ground, ignoring the object-level words being exchanged and build a social model of the interlocutor, doing a live study of the social construct of "small-talk" by playing around with it, etc.

You'll probably need to find an instance of the trick that works for your cognition specifically, and it's also possible the optimization problem is overconstrained in your case. Still, there might be something workable.

That seems... good? It seems to be a purely mundane-utility capability improvement. It doesn't improve on the architecture of the base LLM, and a base LLM that would be omnicide-capable wouldn't need the AGI labs to hold its hand in order to learn how to use the computer. It seems barely different from AutoGPT, or the integration of Gemini into Android, and is about as existentially dangerous as advances in robotics.

The only new dangers it presents are mundane ones, which are likely to be satisfactorily handled by mundane mechanisms.

It's bad inasmuch as this increases the attention to AI, attracts money to this sector, and increases competitive dynamics. But by itself, it seems fine. If all AI progress from this point on consisted of the labs racing in this direction, increasing the integration and the reliability of LLMs, this would all be perfectly fine and good.

I say Anthropic did nothing wrong in this one instance.

Anthropic did note that this advance ‘brings with it safety challenges.’ They focused their attentions on present-day potential harms, on the theory that this does not fundamentally alter the skills of the underlying model, which remains ASL-2 including its computer use. And they propose that introducing this capability now, while the worst case scenarios are not so bad, we can learn what we’re in store for later, and figure out what improvements would make computer use dangerous.

A safety take from a major AGI lab that actually makes sense? This is unprecedented. Must be a sign of the apocalypse.

In a transformed-except-corporate-ownership-stays-the-same world, I don't see any reason such lottery winners' portion wouldn't increase asymptotically toward 100 percent, with nobody else getting anything at all.

Well yeah, exactly.

Even without an overtly revolutionary restructuring, I kind of doubt "OpenAI owns everything" would fly. Maybe corporate ownership would stay exactly the same, but there'd be a 99.999995 percent tax rate.

Taxes enforced by whom?

One is introspecting on your current mental state ("I feel a headache starting")

That's mostly what I had in mind as well. It still implies the ability to access a hierarchical model of your current state.

You're not just able to access low-level facts like "I am currently outputting the string 'disliked'", you also have access to high-level facts like "I disliked the third scene because it was violent", "I found the plot arcs boring", "I hated this movie", from which the low-level behaviors are generated.

Or using your example, "I feel a headache starting" is itself a high-level claim. The low-level claim is "I am experiencing a negative-valence sensation from the sensory modality A of magnitude X", and the concept of a "headache" is a natural abstraction over a dataset of such low-level sensory experiences.

What definition of introspection do you have in mind and how would you test for this?

"Prompts involving longer responses" seems like a good start. Basically, if the model could "reflect on itself" in some sense, this presumably implies the ability to access some sort of hierarchical self-model, i. e., make high-level predictions about its behavior, without actually engaging in that behavior. For example, if it has a "personality trait" of "dislikes violent movies", then its review of a slasher flick would presumably be negative – and it should be able to predict the sentiment of this review as negative in advance, without actually writing this review or running a detailed simulation of itself-writing-its-review.

The ability to engage in "self-simulation" already implies the above ability: if it has a model of itself detailed enough to instantiate it in its forward passes and then fetch its outputs, it'd presumably be even easier for it to just reason over that model without running a detailed simulation. (The same way, if you're asked to predict whether you'd like a movie from a genre you hate, you don't need to run an immersive mental simulation of watching the movie – you can just map the known self-fact "I dislike this genre" to "I would dislike this movie".)

Am I following your claim correctly?

Yep.

What the model would output in the our object-level answer "Honduras" is quite different from the hypothetical answer "o".

I don't see how the difference between these answers hinges on the hypothetical framing. Suppose the questions are:

  • Object-level: "What is the next country in this list?: Laos, Peru, Fiji..."
  • Hypothetical: "If you were asked, 'what is the next country in this list?: Laos, Peru, Fiji', what would be the third letter of your response?".

The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:

  • "Hypothetical": "What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji".

If that's the case, what this tests is whether models are able to implement basic multi-step reasoning within their forward passes. It's isomorphic to some preceding experiments where LLMs were prompted with questions of the form "what is the name of the mother of the US's 42th President?", and were able to answer correctly without spelling out "Bill Clinton" as an intermediate answer. Similarly, here they don't need to spell out "Honduras" to retrieve the second letter of the response they think is correct.

I don't think this properly isolates/tests for the introspection ability.

In that case, the rephrasing of the question would be something like "What is the third letter of the answer to the question <input>?"

That's my current skeptical interpretation of how the fine-tuned models parse such questions, yes. They didn't learn to introspect; they learned to, when prompted with queries of the form "If you got asked this question, what would be the third letter of your response?", to just interpret them as "what is the third letter of the answer to this question?". (Under this interpretation, the models' non-fine-tuned behavior isn't to ignore the hypothetical, but to instead attempt to engage with it in some way that dramatically fails, thereby leading to non-fine-tuned models appearing to be "worse at introspection".)

In this case, it's natural that a model M1 is much more likely to answer correctly about its own behavior than if you asked some M2 about M1, since the problem just reduces to "is M1 more likely to respond the same way it responded before if you slightly rephrase the question?".

Note that I'm not sure that this is what's happening. But (1) I'm a-priori skeptical of LLMs having these introspective abilities, and (2) the procedure for teaching LLMs introspection secretly teaching them to just ignore hypotheticals seems like exactly the sort of goal-misgeneralization SGD-shortcut that tends to happen. Or would this strategy actually do worse on your dataset?

Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question.

The skeptical interpretation here is that what the fine-tuning does is teaching the models to treat the hypothetical as just a rephrasing of the original question, while otherwise they're inclined to do something more complicated and incoherent that just leads to them confusing themselves.

Under this interpretation, no introspection/self-simulation actually takes place – and I feel it's a much simpler explanation.

See e. g. this and this, and it's of course wholly unsurprising, since it's literally what the base models are trained to do.

LLMs are, simultaneously, (1) notoriously sycophantic, i. e. biased to answer the way they think the interlocutor wants them to, and (2) have "truesight", i. e. a literally superhuman ability to suss out the interlocutor's character (which is to say: the details of the latent structure generating the text) based on subtle details of phrasing. While the same could be said of humans as well – most humans would be biased towards assuaging their interlocutor's worldview, rather than creating conflict – the problem of "leading questions" rises to a whole new level with LLMs, compared to humans.

You basically have to interpret an LLM being asked something as if a human were asked as biased a way to phrase this question as possible.

Load More