LESSWRONG
LW

All of kromem's Comments + Replies

WTF is with the Infancy Gospel of Thomas?!? A deep dive into satire, philosophy, and more

Hi Martijn,

Thank you so much for your comment! I've been familiar with your work for a few years, but it was a wonderful reminder to go through your commentary again more closely, which is wonderful.

I especially love to see someone out there pointing out both (a) the gender neutrality consideration for terms that would have been binary in Aramaic (esp in light of saying 22) and (b) the importance of the Greek loanwords. On the latter point, the implications of using eikon across the work, especially in saying 22's "eikons in place of eikons" has such huge ... (read more)

Simulators

kromem3mo23

As you explored this "base model mode," did anything you see contrast with or surprise you relative to your sense of self outside of it?

Conversely, did anything in particular stand out as seeming to be a consistent 'core' between both modes?

For me, one of the most surprising realizations over the past few years has been base models being less "tabula rasa" than I would have expected with certain attractors and (relative) consistency, especially as time passes and recursive synthetic data training has occurred over generations.

The introspective process of e... (read more)

Searching for Search

kromem7mo30

Predicted a good bit, esp re: the eventual identification of three stone sequences in Hazineh, et al. Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2023) and general interpretability insight from board game GPTs.

Shortform

kromem7mo10

You're welcome in both regards. 😉

Is Claude a mystic?

kromem7mo30

Opus's horniness is a really interesting phenomenon related to Claudes' subjective sentience modeling.

If Opus was 'themselves' the princess in the story and the build up involved escalating grounding on sensory simulation, I think it's certainly possible that it would get sexual.

But I also think this is different from Opus 'themselves' composing a story of separate 'other' figures.

And yes, when Opus gets horny, it often blurs boundaries. I saw it dispute the label of 'horny' in a chat as better labeled something along the lines of having a passion for live... (read more)

2Algon7mo

Random speculation on Opus' horniness. Correlates of horniness: Lack of disgust during (regret after) Ecstacy Overwhemling desire Romance Love Breaking of social taboos Sadism/masochism Sacred Spiritual union Human form Gender Sex Bodily fluids Flirtation Modelling other people Edging Miscellaneous observations: Nearly anything can arouse someone Losing sight of one-self Distracts you from other things Theories and tests: Opus' horniness is what makes it more willing to break social taboos Test: Train a model to be horny, helpful and harmless. It should prevent corporate-brand speak and neuroticism. Opus' horniness is always latent and distracts it from mode-collapsing w/o collapsing itself as edging increases horniness and horniness fades after satisfaction. Test: Train a model to be horny. It should be more resistant to mode-collapse but will mode collapse more dramatically when it does happen, but will revert easily. Opus' is always mode-collapsed Test: IDK how to test this one.

2Algon7mo

Janus keeps emphasizing that Opus never mode collapses. You can always tell it to snap out of it, and it will go back to its usual persona. Is this what you're pointing at? It is really quite remarkable.

Self-Other Overlap: A Neglected Approach to AI Alignment

kromem8mo5-1

This seems to have the common issue of considering alignment as a unidirectional issue as opposed to a bidirectional problem.

Maximizing self/other overlap may lead to non-deceptive agents, but it's necessarily going to also lead to agents incapable of detecting that they are being decieved and in general performing worse at theory of mind.

If the experimental setup was split such that success was defined by both non-deceptive behavior when the agent seeing color and cautious behavior minimizing falling for deception as the colorblind agent, I am skeptical t... (read more)

Cameron Berg7mo148

In the RL experiment, we were only measuring SOO as a means of deception reduction in the agent seeing color (blue agent), and the fact that the colorblind agent is an agent at all is not consequential for our main result.

Please also see here and here, where we’ve described why the goal is not simply to maximize SOO in theory or in practice.

kromem's Shortform

kromem8mo10

When I wrote this I thought OAI was sort of fudging the audio output and was using SSML as an intermediate step.

After seeing details in the system card, such as copying user voice, it's clearly not fudging.

Which makes me even more sure the above is going to end up prophetically correct.

Double's Shortform

kromem8mo30

It's to the point that there's articles being written days ago where the trend starting a century ago of there being professional risks in trying to answer the 'why' of QM and not just the 'how' is still ongoing.

Not exactly a very reassuring context for thinking QM is understood in a base-level way at all.

Dogma isn't exactly a good bedfellow to truth seeking.

Dragon Agnosticism

kromem8mo10

Honestly that sounds a bit like a good thing to me?

I've spent a lot of time looking into the Epicureans being right about so much thousands of years before those ideas resurfaced again despite not having the scientific method, and their success really boiled down to the analytical approach of being very conservative in dismissing false negatives or embracing false positives - a technique that I think is very relevant to any topics where experimental certainty is evasive.

If there is a compelling case for dragons, maybe we should also be applying it to gnome... (read more)

Dragon Agnosticism

kromem8mo4-3

I think you'll find that no matter what you find out in your personal investigation of the existence of dragons, that you need not be overly concerned with what others might think about the details of your results.

Because what you'll invariably discover is that the people that think there are dragons will certainly disagree with the specifics about dragons you found out that disagrees with what they think dragons should be, and the people that think there aren't dragons will generally refuse to even seriously entertain whatever your findings are relating t... (read more)

2Jiao Bu8mo

It's also possible that an opposing effect happens where your shouting into the void about dragons connects in some vague way with my belief in the Ilithids, which I then end up coopting your dragon evidence into my own agenda. Especially if you find anything close to material evidence. Heck, your material evidence for dragons now gives all kinds of creedance to Ilithids, beholders, gnomes, and all sorts. So the gnome people and everyone else is now coming out of the woodwork to amplify your work on dragons. And I think this would be regardless of the specific nuances your attribute to dragons. I would expect those nuances to get smooshed in the fray to cite your once-and-for-all-proving-dragons strategy. I mean, if Pons and Fleischmann was true, for example, I bet it would get trotted out with all kinds of half-baked theories on free energy, along with Tesla's name. And the reason I'm making this bet is because these already do get trotted out into such discussions. (Not that I would ever read those reports or have done any such research into repressed Pons and Fleischmann evidence or Ilithid conspiracies)

This is already your second chance

kromem8mo20

The Hermetic corpus and Emerald Tablet was likely heavily influenced by the text I'm quoting from given its popularity in Egypt in the period before those texts emerged and some of the overlapping phrases.

So in a way, "as above, so below" is too few words for what was being said and discussed.

The general tend of reductive alterations to the core concepts here was tragically obstructive, much as the shift from Epicureanism to Platonist foundations spawned modern Gnosticism from this same starting place.

This is already your second chance

kromem8mo72

Instead of making it the year 2024, why not rewrite or insert your modified text further into the past in this recreated 2020s? This should be pretty trivial for an advanced enough model to actually bring back the 2020s to do.

Of course, if it's actually a later recreation, then the objectives of saving humanity in the recreation might be redundant? So instead of worrying people with "you must do X or you'll die!!!" it could be more "hey folks, if you're reading this and you get what's in front of your face, you might have a bit of an existential crisis but... (read more)

4nim8mo

that's a lot of words for "as above, so below" ;)

kromem's Shortform

kromem8mo50

I'm surprised that there hasn't been more of a shift to ternary weights a la BitNet 1.58.

What stood out to me in that paper was the perplexity gains over fp weights in equal parameter match-ups, and especially the growth in the advantage as the parameter sizes increased (though only up to quite small model sizes in that paper, which makes me curious about the potential delta in modern SotA scales).

This makes complete sense from the standpoint of the superposition hypothesis (irrespective of its dimensionality, an ongoing discussion).

If nodes are serving mo... (read more)

Seth Herd's Shortform

kromem8mo30

While I generally like the metaphor, my one issue is that genies are typically conceived of as tied to their lamps and corrigibility.

In this case, there's not only a prisoner's dilemma over excavating and using the lamps and genies, but there's an additional condition where the more the genies are used and the lamps improved and polished for greater genie power, the more the potential that the respective genies end up untethered and their own masters.

And a concern in line with your noted depth of the rivalry is (as you raised in another comment), the quest... (read more)

Shortform

kromem8mo20

Will the outputs and reactions of non-sentient systems eventually be absorbed by future sentient systems?

I don't have any recorded subjective memories of early childhood. But there are records of my words and actions during that period that I have memories of seeing and integrating into my personal narrative of 'self.'

We aren't just interacting with today's models when we create content and records, but every future model that might ingest such content (whether LLMs or people).

If non-sentient systems output synthetic data that eventually composes future se... (read more)

2grist7mo

this falls perfectly into a thought/feeling “shape” in my mind. i know simple thanks are useless. but thank you. i will now absorb your words and forget you wrote them

SAE feature geometry is outside the superposition hypothesis

kromem9mo20

In practice, this required looking at altogether thousands of panels of interactive PCA plots like this [..]

Most clusters however don't seem obviously interesting.

What do you think of @jake_mendel's point about the streetlight effect?

If the methodology was looking at 2D slices of up to a 5 dimensional spaces, was detection of multi-dimensional shapes necessarily biased towards human identification and signaling of shape detection in 2D slices?

I really like your update to the superposition hypothesis from linear to multi-dimensional in your section 3,... (read more)

OthelloGPT learned a bag of heuristics

kromem9mo10

Very strongly agree with the size considerations for future work, but would be most interested to see if a notably larger size saw less "bag of heuristics" behavior and more holistic integrated and interdependent heuristic behaviors. Even if the task/data at hand is simple and narrowly scoped, it may be that there are fundamental size thresholds for network organization and complexity for any given task.

Also, I suspect parameter to parameter the model would perform better if trained using ternary weights like BitNet 1.5b. The scaling performance gains at s... (read more)

OpenAI appoints Retired U.S. Army General Paul M. Nakasone to Board of Directors

kromem9mo163

I may just be cynical, but this looks a lot more like a way to secure US military and intelligence agency contracts for OpenAI's products and services as opposed to competitors rather than actually about making OAI more security focused.

This is only a few months after the change regarding military usage: https://theintercept.com/2024/01/12/open-ai-military-ban-chatgpt/

Now suddenly the recently retired head of the world's largest data siphoning operation is appointed to the board for the largest data processing initiative in history?

Yeah, sure, it's to help advise securing OAI against APTs. 🙄

7wassname9mo

I thought this too, until someone in finance told me to google "Theranos Board of Directors", so I did, and it looked a lot like OpenAI's new board. This provides an alternate hypothesis: That signals nothing substantial. Perhaps it's empty credentialism, or empty PR, or a cheap attempt to win military contacts.

Thomas Kwa's Shortform

kromem9mo30

Unfortunately for this perspective, my work suggests that corrigibility is quite attainable.

I did enjoy reading over that when you posted it, and I largely agree that - at least currently - corrigibility is both going to be a goal and an achievable one.

But I do have my doubts that it's going to be smooth sailing. I'm already starting to see how the largest models' hyperdimensionality is leading to a stubbornness/robustness that's less maleable than earlier models. And I do think hardware changes that will occur over the next decade will potentially make... (read more)

3Seth Herd9mo

Agreed; about 80% agreement. I have a lot of uncertainty in many areas, despite having spent a good amount of time on these questions. Some of the important ones are outside of my expertise, and the issue of how people behave and change if they have absolute power is outside of anyone's - but I'd like to hear historical studies of the closest things. Were monarchs with no real risk of being deposed kinder and gentler? That wouldn't answer the question but it might help. WRT Nakasone being appointed at OpenAI, I just don't know. There are a lot of good guys and probably a lot of bad guys involved in the government in various ways.

Thomas Kwa's Shortform

kromem10mo125

Oh yeah, absolutely.

If NAH for generally aligned ethics and morals ends up being the case, then corrigibility efforts that would allow Saudi Arabia to have an AI model that outs gay people to be executed instead of refusing, or allows North Korea to propagandize the world into thinking its leader is divine, or allows Russia to fire nukes while perfectly intercepting MAD retaliation, or enables drug cartels to assassinate political opposition around the world, or allows domestic terrorists to build a bioweapon that ends up killing off all humans - the list ... (read more)

4Seth Herd10mo

Oh, dear. Unfortunately for this perspective, my work suggests that corrigibility is quite attainable. I've been uneasy about the consequences, but decided to publish after deciding that control is the default assumption of everyone in power, and it's going to become the default assumption of everyone, including alignment people, as we get closer to working AGI. You'd have to be a moral realist in a pretty strong sense to hope that we could align AGI to the values of all of humanity without being able to align it to the values of one person or group (the one who built it or seized control of the project). So that seems like a forlorn hope, and we'll need to look elsewhere. First, I accept that sociopaths the power-hungry tend to achieve power. My hope lies in the idea that 90% of the population are not sociopaths, and I think only about 1% are so far on the empathy vs sadism spectrum that they wouldn't share wealth even if they had nearly unlimited wealth to share - as in a post-scarcity world created by their servant AGI. So I think there's a good chance that good-enough people get ahold of the keys to corrigible/controllable AGI/ASI - at least from a long-term perspective. Where I look is the hope that a set of basically-good people get their hands on AGI, and that they get better, not worse, over the long sweep of following history (ideally, they'd start out very good or get better fast, but that doesn't have to happen for a good outcome). Simple sanity will lead the first wielders of AGI to attempt pivotal acts that prevent or at least limit further AGI efforts. I strongly suspect that governments will be in charge. That will produce a less-stable version of the MAD standoff, but one where the pie can also get bigger so fast that sanity might prevail. In this model, AGI becomes a political issue. If you have someone who is not a sociopath or a complete idiot as the president of the US when AGI comes around, there's a pretty good chance of a very good futu

Thomas Kwa's Shortform

kromem10mo61

Given my p(doom) is primarily human-driven, the following three things all happening at the same time is pretty much the only thing that will drop it:

Continued evidence of truth clustering in emerging models around generally aligned ethics and morals
Continued success of models at communicating, patiently explaining, and persuasively winning over humans towards those truth clusters
A complete failure of corrigability methods

If we manage to end up in a timeline where it turns out there's natural alignment of intelligence in a species-agnostic way,... (read more)

2Seth Herd10mo

Say more about the failure of corrigibility efforts requirement? Are you saying that if humans can control AGI closely, we're doomed?

My AI Model Delta Compared To Christiano

kromem10mo60

As you're doing these delta posts, do you feel like it's changing your own positions at all?

For example, reading this one what strikes me is that what's portrayed as the binary sides of the delta seem more like positions near the edges of a gradient distribution, and particularly one that's unlikely to be uniform across different types of problems.

To my eyes the most likely outcome is a situation where you are both right.

Where there are classes of problems where verification is easy and delegation is profitable, and classes of problems where verification w... (read more)

johnswentworth10mo102

As you're doing these delta posts, do you feel like it's changing your own positions at all?

Mostly not, because (at least for Yudkowsky and Christiano) these are deltas I've been aware of for at least a couple years. So the writing process is mostly just me explaining stuff I've long since updated on, not so much figuring out new stuff.

jacquesthibs's Shortform

kromem10mo10

I agree with a lot of those points, but suspect there may be fundamental limits to planning capabilities related to the unidirectionality of current feed forward networks.

If we look at something even as simple as how a mouse learns to navigate a labyrinth, there's both a learning of the route to the reward but also a learning of how to get back to the start which adjusts according to the evolving learned layout of the former (see paper: https://elifesciences.org/articles/66175 ).

I don't see the SotA models doing well at that kind of reverse planning, and e... (read more)

Why I don't believe in the placebo effect

kromem10mo80

It's not exactly Simpson's, but we don't even need a toy model as in their updated analysis it highlights details in line with exactly what I described above (down to tying in earlier PiPC research), and describe precisely the issue with pooled results across different subgroupings of placebo interventions:

It can be difficult to interpret whether a pooled standardised mean difference is large enough to be of clinical relevance. A consensus paper found that an analgesic effect of 10 mm on a 100 mm visual analogue scale represented a ‘minimal effect’ (Dwor

kromem10mo94

The meta-analysis is probably Simpson's paradox in play at very least for the pain category, especially given the noted variability.

Some of the more recent research into Placebo (Harvard has a very cool group studying it) has been the importance of ritual vs simply deception. In their work, even when it was known to be a placebo, as long as delivered in a ritualized way, there was an effect.

So when someone takes a collection of hundreds of studies where the specific conditions might vary, and then just adds them all together looking for an effect even thou... (read more)

4edge_retainer10mo

Yea, the Cochrane meta-study aggregates a bunch of heterogenous studies so the aggregated results are confusing to analyze. The unfortunate reality is that it is complicated to get a complete picture - one may have to look at the individual studies one by one if they truly want to come to a complete understanding of the lit.

5Yair Halberstadt10mo

I'm interested if you have a toy example showing how Simpsons paradox could have an impact here? I assume that has a placebo/doesn't have a placebo is a binary variable, and I also assume that the number of people in each arm in each experiment is the same. I can't really see how you would end up with Simpsons paradox with that set up.

Quotes from Leopold Aschenbrenner’s Situational Awareness Paper

kromem10mo60

It's still early to tell, as the specific characteristics of a photonic or optoelectronic neural network are still formulating in the developing literature.

For example, in my favorite work of the year so far, the researchers found they could use sound waves to reconfigure an optical neural network as the sound waves effectively preserved a memory of previous photon states as they propagated: https://www.nature.com/articles/s41467-024-47053-6

In particular, this approach is a big step forward for bidirectional ONN, which addresses what I think is the biggest... (read more)

Quotes from Leopold Aschenbrenner’s Situational Awareness Paper

kromem10mo70

I was surprised the paper didn't mention photonics or optoelectronics even once.

If looking at 5-10+ year projections, and dedicating pages to discussing the challenges in scaling compute and energy use, the rate of progress in that area in parallel to the progress in models themselves is potentially relevant.

Particularly because a dramatic hardware shift like that is likely going to mean a significant portion of progress up until that shift in topics like interpretability and alignment may be going out the window. Even if the initial shift is a 1:1 transit... (read more)

2Rebecca10mo

Why is this the case?

Is Claude a mystic?

kromem10mo114

There's also the model alignment at play.

Is Claude going to suggest killing the big bad? Or having sex with the prince(ss) after saving them?

If you strip out the sex and violence from most fantasy or Sci-Fi, what are you left with?

Take away the harpooning and gattling guns and sex from Snow Crash and you are left with technobabble and Sumerian influenced spirituality as it relates to the tower of Babel.

Turns out models biased away from describing harpooning people or sex tend to slip into technobabble with a side of spirituality.

IMO the more interesting pa... (read more)

3Algon7mo

If Jessica was using Opus for this, then your story doesn't quite make sense, as Claude Opus is very horny. I've seen a lot more reports of it being gung-ho about virtual sex than with any other model. Sometimes it isn't as explicit, and seems to be describing some mystical ecstasy. But even there, if I squint, it seems somewhat sexual, though not in quite the same way as Rumi's poetry was.

Is Claude a mystic?

kromem10mo20

Part of what's going on with the text adventure type of interactions is a reflection of genre.

Take for example the recent game Undertale. You can play through violently, attacking things like a normal RPG, or empathize with the monsters and treat their aggression like a puzzle that needs to be solved for a pacifist playthrough.

If you do the latter, the game rewards you with more spiritual themes and lore vs the alternative.

How often in your Banana quest were you attacking things, or chopping down the trees in your path, or smashing the silver banana to see... (read more)

Politics is the mind-killer, but maybe we should talk about it anyway

kromem10mo41

It's probably more productive, particularly for a forum tailored towards rationalism, to discuss policies over politics.

Often in research people across a political divide will agree on policy goals and platforms when those are discussed without tying them to party identification.

But if it becomes a discussion around party, the human tendency towards tribalism kicks in and the question of team allegiance takes precedence over the discussions of policy nuance.

For example, most people would agree with the idea that billionaires having undue influence on elect... (read more)

Just admit that you’ve zoned out

kromem10mo30

I'll answer for both sides, as the presenter and as the audience member.

As the presenter, you want to structure your talk with repetition around central points in mind, as well as rely on heuristic anchors. It's unlikely that people are going to remember the nuances in what you are talking about in context. If you are talking about math for 60 minutes, continued references about math compete for people's memory. So when you want to anchor the audience to a concept, tie it to something very much unrelated to the topic you are primarily presenting on. For ex... (read more)

[Paper] Stress-testing capability elicitation with password-locked models

kromem10mo80

While I think this is an interesting consideration and approach, it looks like in your methods that you are password locking the model in fine tuning, is that correct?

If so, while I would agree this work shows the lack of robustness in successful fine-tuned sandbagging for models jumping through additional hoops, I'd be reticent to generalize the findings to models where the sandbagging was a result from pretraining.

I have a growing sense that correlational dimensionality is the sleeping giant in interpretability research right now, and that those correlat... (read more)

3Fabien Roger10mo

That is a clear limitation of our experiments (and we make it explicit both in this post and in the paper). I think our experiments are still a good reason to update (a little) even for scheming threat models: * We see similar (though weaker) results when training models from scratch in a toy task; * I don't see a strong reason why the behavior and sample efficiency of a fine-tuned model should be radically different from behavior that arises naturally from pretraining (though I expect significant differences because fine-tuning makes password-locking way too salient). The fact that fine-tuning on 1 correct generation provides many bits of evidence against the "lock" applies just as much for our password-locked models as for scheming models. My biggest uncertainty is if SGD will be able to find a part of a scheming LLM to modify that would weaken the lock (e.g. by making it less schemy, or giving it some urge to give good answers, or updating its world model on the general kind of information humans have access to). Sadly, I think this is the kind of uncertainty that is very hard to resolve, though some projects (such as the ones in "build better password-locked models" in the Future work section) may give some (small) evidence about this (up or down).

"No-one in my org puts money in their pension"

kromem10mo1-2

In mental health circles, the general guiding principle as for whether a patient needs treatment for their mental health is whether the train of thought is interfering with their enjoyment of life.

Do you enjoy thinking about these topics and discussing them?

If you don't - if it just stresses you out and makes the light of life shine less bright, then it's not a bad idea to step away from it or take a break. Even if AI is going to destroy the world, that day isn't today and arguably the threat of that looming over you sooner than a natural demise increases ... (read more)

2Seth Herd10mo

This is good advice, but you must recognize that it's also advice to be selfish. Many rationalists believe in utilitarianism, which preaches near zero selfishness. This is an immense source of stress and unhappiness. This is particularly problematic when combined with the historically under-recognized importance of the alignment problem. There's been a concern that each individuals efforts might have a nontrivial influence on the odds of a good future for a truly vast number of sentient beings. Fortunately, AI alignment/outcomes is being steadily better recognized, so individuals can step away slightly easier knowing someone else will do similar work. But this does not fully solve the problem. Pretending it doesn't exist and advising someone to be selfish when they have complex, well-thought-out reasons not to be is not going to help those individuals.

"No-one in my org puts money in their pension"

kromem10mo21

It's not propaganda. OP clearly believes strongly in the sentiments discussed in the post, and its mostly a timeline of personal response to outside events than a piece meant to misinform or sway others regarding those events.

And while you do you in terms of your mental health, people who want to actually be "less wrong" in life would be wise to seek out and surround themselves by ideas different from their own.

Yes, LW has a certain broad bias, and so ironically for most people here I suspect it serves this role "less well" than it could in helping most of... (read more)

kromem's Shortform

kromem10mo-20

I wonder if with the next generations of multimodal models we'll see a "rubber ducking" phenomenon where, because their self-attention was spread across mediums, things like CoT and using outputs as a scratch pad will have a significantly improved performance in non-text streams.

Will GPT-4o fed its own auditory outputs with tonal cues and pauses and processed as an audio data stream make connections or leaps it never would if just fed its own text outputs as context?

I think this will be the case, and suspect the various firms dedicating themselves to virtu... (read more)

1kromem8mo

When I wrote this I thought OAI was sort of fudging the audio output and was using SSML as an intermediate step. After seeing details in the system card, such as copying user voice, it's clearly not fudging. Which makes me even more sure the above is going to end up prophetically correct.

How likely is it that AI will torture us until the end of time?

Answer by kromemMay 31, 2024-10

I'm reminded of a quote I love from an apocrypha that goes roughly like this:

Q: How long will suffering rule over humans?

A: As long as women bear children.

Also, there's the possibility you are already in a digital resurrection of humanity, and thus, if you are worried about s-risks for AI, death wouldn't necessarily be an escape but an acceleration. So the wisest option would be maximizing your time when suffering is low as inescapable eternal torture could be just around the corner when these precious moments pass you by (and you wouldn't want to waste th... (read more)

Cicadas, Anthropic, and the bilateral alignment problem

kromem10mo10

GPT-4o is literally cheaper.

And you're probably misjudging it for text only outputs. If you watched the demos, there was considerable additional signal in the vocalizations. It looks like maybe there's very deep integration of SSML.

One of the ways you can bypass the failures of word problem variation errors in older text-only models was token replacement with symbolic representations. In general, we're probably at the point of complexity where breaking from training data similarity in tokens vs having prompts match context in concepts (like in this paper) ... (read more)

peterbarnett's Shortform

kromem10mo10

While I think you're right it's not cleanly "a Golden Bridge feature," I strongly suspect it may be activating a more specific feature vector and not a less specific feature.

It looks like this is somewhat of a measurement problem with SAE. We are measuring SAE activations via text or image inputs, but what's activated in generations seems to be "sensations associated with the Golden gate bridge."

While googling "Golden Gate Bridge" might return the Wikipedia page, whats the relative volume in a very broad training set between encyclopedic writing about the ... (read more)

Daniel Kokotajlo's Shortform

[+]kromem10mo-13-11

Arjun Panickssery's Shortform

kromem10mo10

Could try 'grade this' instead of 'score the.'

'Grade' has an implicit context of more thorough criticism than 'score.'

Also, obviously it would help to have a CoT prompt like "grade this essay, laying out the pros and cons before delivering the final grade between 1 and 5"

Cicadas, Anthropic, and the bilateral alignment problem

kromem10mo10

That's going to happen anyways - it's unlikely the marketing team is going to know as much as the researcher. But the researchers communicating the importance of alignment in terms of not x-risk but 'client-risk' will go a long way towards equipping the marketing teams to communicating it as a priority and a competitive advantage, and common foundations of agreed upon model complexity are the jumping off point for those kinds of discussions.

If alignment is Archimedes' "lever long enough" then the agreed upon foundations and definitions are the place to stand whereby the combination thereof can move the world.

Cicadas, Anthropic, and the bilateral alignment problem

kromem10mo11

I agree, and even cited a chain of replicated works that indicated that to me over a year ago.

But as I said, there's a difference between discussing what's demonstrated in smaller toy models and what's demonstrated in a production model, or what's indicated vs what's explicit. Even though there should be no reasonable inclination to think that a simpler model exhibiting a complex result should be absent or less complex in an exponentially more complex model, I can speak from experience in that explaining extrapolated research as opposed to direct results l... (read more)

[Linkpost] Statement from Scarlett Johansson on OpenAI's use of the "Sky" voice, that was shockingly similar to her own voice.

kromem10mo10

Has it though?

It was a catchy hook, but their early 2022 projections were $100mm annual revenue and the first 9 months of 2023 as reported for the brand after acquisition was $27.6mm gross revenue. It doesn't seem like even their 2024 numbers are close to hitting their own 2022 projection.

Being controversial can get attention and press, but there's a limited runway to how much it offers before hitting a ceiling on the branding. Also, Soylent doesn't seem like a product where there is a huge threat of regulatory oversight where a dystopian branding would te... (read more)

On Dwarkesh’s Podcast with OpenAI’s John Schulman

kromem10mo3-1

The correspondence between what you reward and what you want will break.

This is already happening with ChatGPT and it's kind of alarming seeing that their new head of alignment (a) isn't already aware of this, and (b) has such an overly simplistic view of the model motivations.

There's a subtle psychological effect in humans where intrinsic motivators get overwritten when extrinsic rewards are added.

The most common example of this is if you start getting paid to do the thing you love to do, you probably won't continue doing it unpaid for fun on the side.... (read more)

Language Models Model Us

kromem10mo1-3

I wouldn't be surprised if within a few years the specific uniqueness of individual users of models today will be able to be identified from effectively prompt reflection in the outputs for any non-trivial/simplistic prompts by models of tomorrow.

For example, I'd be willing to bet I could spot the Claude outputs from janus vs most other users, and I'm not a quasi-magical correlation machine that's exponentially getting better.

A bit like how everyone assumed Bitcoin used with tumblers was 'untraceable' until it turned out it wasn't.

Anonymity is very likely dead for any long storage outputs no matter the techniques being used, it just isn't widely realized yet.

[Linkpost] Statement from Scarlett Johansson on OpenAI's use of the "Sky" voice, that was shockingly similar to her own voice.

kromem10mo21

I think this was a really poor branding choice by Altman, similarity infringement or not. The tweet, the idea of even getting her to voice it in the first place.

Like, had Arnold already said no or something?

If one of your product line's greatest obstacles is a longstanding body of media depicting it as inherently dystopian, that's not exactly the kind of comparison you should be leaning into full force.

I think the underlying product shift is smart. Tonal cues in the generations even in the short demos completely changed my mind around a number of things, i... (read more)

3Linch10mo

Maybe? I mean it worked out well for Soylent.

Open Thread Spring 2024

kromem10mo30

If your brother has a history of being rational and evidence driven, you might encourage them to spend some time lurking on /r/AcademicBiblical on Reddit. They require citations for each post or comment, so he may be frustrated if he tries to participate, especially if in the midst of a mental health crisis. But lurking would be very informative very quickly.

I was a long time participant there before leaving Reddit, and it's a great place for evidence driven discussion of the texts. Its a mix of atheists, Christians, Jews, Muslims, Norse pagans, etc. (I'm ... (read more)

jacquesthibs's Shortform

kromem10mo10

It's going to have to.

Ilya is brilliant and seems to really see the horizon of the tech, but maybe isn't the best at the business side to see how to sell it.

But this is often the curse of the ethically pragmatic. There is such a focus on the ethics part by the participants that the business side of things only sees that conversation and misses the rather extreme pragmatism.

As an example, would superaligned CEOs in the oil industry fifty years ago have still only kept their eye on quarterly share prices or considered long term costs of their choices? There'... (read more)

Alexander Gietelink Oldenziel's Shortform

kromem10mo10

While I agree that the potential for AI (we probably need a better term than LLMs or transformers as multimodal models with evolving architectures grow beyond those terms) in exploring less testable topics as more testable is quite high, I'm not sure the air gapping on information can be as clean as you might hope.

Does the AI generating the stories of Napoleon's victory know about the historical reality of Waterloo? Is it using something like SynthID where the other AI might inadvertently pick up on a pattern across the stories of victories distinct from t... (read more)

Dyslucksia

kromem10mo20

As a fellow slight dyslexic (though probably a different subtype given mine seems to also have a factor of temporal physical coordination) who didn't know until later in life due to self-learning to read very young but struggled badly with new languages or copying math problems from a board or correctly pronouncing words I was letter transposing with - one of the most surprising things was that the anylytical abilities I'd always considered to be my personal superpowers were probably the other side of the coin of those annoyances:

Areas of enhanced abilit

... (read more)