1 min read

2

This is a special post for quick takes by jacquesthibs. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
jacquesthibs's Shortform
297 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I would find it valuable if someone could gather an easy-to-read bullet point list of all the questionable things Sam Altman has done throughout the years.

I usually link to Gwern’s comment thread (https://www.lesswrong.com/posts/KXHMCH7wCxrvKsJyn/openai-facts-from-a-weekend?commentId=toNjz7gy4rrCFd99A), but I would prefer if there was something more easily-consumable.

[Edit #2, two months later: see https://ailabwatch.org/resources/integrity/]

[Edit: I'm not planning on doing this but I might advise you if you do, reader.]

50% I'll do this in the next two months if nobody else does. But not right now, and someone else should do it too.

Off the top of my head (this is not the list you asked for, just an outline):

  • Loopt stuff
  • YC stuff
  • YC removal
  • NDAs
    • And deceptive communication recently
    • And maybe OpenAI's general culture of don't publicly criticize OpenAI
  • Profit cap non-transparency
  • Superalignment compute
  • Two exoduses of safety people; negative stuff people-who-quit-OpenAI sometimes say
  • Telling board members not to talk to employees
  • Board crisis stuff
    • OpenAI executives telling the board Altman lies
    • The board saying Altman lies
    • Lying about why he wanted to remove Toner
    • Lying to try to remove Toner
    • Returning
    • Inadequate investigation + spinning results

Stuff not worth including:

  • Reddit stuff - unconfirmed
  • Financial conflict-of-interest stuff - murky and not super important
  • Misc instances of saying-what's-convenient (e.g. OpenAI should scale because of the prospect of compute overhang and the $7T chip investment thing) - idk, maybe, also interested in more examples
  • Johansson
... (read more)
6jacquesthibs
Here’s new one: https://x.com/jacquesthibs/status/1796275771734155499?s=61&t=ryK3X96D_TkGJtvu2rm0uw  Sam added in SEC filings (for AltC) that he’s YC’s chairman. Sam Altman has never been YC’s chairman. From an article posted on April 15th, 2024: “Annual reports filed by AltC for the past 3 years make the same claim. The recent report: Sam was currently chairman of YC at the time of filing and also "previously served" as YC's chairman.” The journalist who replied to me said: “Whether Sam Altman was fired from YC or not, he has never been YC's chair but claimed to be in SEC filings for his AltC SPAC which merged w/Oklo. AltC scrubbed references to Sam being YC chair from its website in the weeks since I first reported this.” The article: https://archive.is/Vl3VR 
2jacquesthibs
Just a heads up, it's been 2 months!
2Zach Stein-Perlman
Not what you asked for but related: https://ailabwatch.org/resources/integrity/
1lemonhope
His sister's accusations that he blocked her from parent's inheritance and that he molested her when he was a young teenager and that he got her social media accounts flagged as spam to hide the accusations
6gwern
I would not consider her claims worth including in a list of top items for people looking for an overview, as they are hard to verify or dubious (her comments are generally bad enough to earn flagging on their own), aside from possibly the inheritance one - as that should be objectively verifiable, at least in theory, and lines up better with the other items.
3robo
I'm very not sure how to do this, but are there ways to collect some counteracting or unbiased samples about Sam Altman?  Or to do another one-sided vetting for other CEOs to see what the base rate of being able to dig up questionable things is?  Collecting evidence in that points in only one direction just sets off huge warning lights 🚨🚨🚨🚨 I can't quiet.
6gwern
Yes, it should. And that's why people are currently digging so hard in the other direction, as they begin to appreciate to what extent they have previously had evidence that only pointed in one direction and badly misinterpreted things like, say, Paul Graham's tweets or YC blog post edits or ex-OAer statements.
2jacquesthibs
Given today's news about Mira (and two other execs leaving), I figured I should bump this again. But also note that @Zach Stein-Perlman has already done some work on this (as he noted in his edit): https://ailabwatch.org/resources/integrity/. Note, what is hard to pinpoint when it comes to S.A. is that many of the things he does have been described as "papercuts". This is the kind of thing that makes it hard to make a convincing case for wrongdoing.

How likely is it that the board hasn’t released specific details about Sam’s removal because of legal reasons? At this point, I feel like I have to place overwhelmingly high probability on this.

So, if this is the case, what legal reason is it?

[-]owencb166

My mainline guess is that information about bad behaviour by Sam was disclosed to them by various individuals, and they owe a duty of confidence to those individuals (where revealing the information might identify the individuals, who might thereby become subject to some form of retaliation).

("Legal reasons" also gets some of my probability mass.)

2jacquesthibs
I think this sounds reasonable, but if this is true, why wouldn’t they just say this?
4Amalthea
It might not be legal reasons specifically, but some hard-to-specify mix of legal reasons/intimidation/bullying. While it's useful to discuss specific ideas, it should be kept in mind that Altman doesn't need to restrict his actions to any specific avenue that could be neatly classified.
3jacquesthibs
My question for as to why they can’t share all the examples was not answered, but Helen gives background on what happened here: https://open.spotify.com/episode/4r127XapFv7JZr0OPzRDaI?si=QdghGZRoS769bGv5eRUB0Q&context=spotify%3Ashow%3A6EBVhJvlnOLch2wg6eGtUa She does confirm she can’t give all of the examples (though points to the ones that were reported), however. Which is not nothing, but eh. However, she also mentioned it was under-reported how much people were scared of Sam and he was creating a very toxic environment.
2Dagon
"legal reasons" is pretty vague.  With billions of dollars at stake, it seems like public statements can be used against them more than it helps them, should things come down to lawsuits.  It's also the case that board members are people, and want to maintain their ability to work and have influence in future endeavors, so want to be seen as systemic cooperators.
2RobertM
But surely "saying nearly nothing" ranks among the worst-possible options for being seen as a "systemic cooperator"?
2Dagon
I should have specified WHO they want to cooperate with in the future.  People with lots of money to spend - businesses.  Silence is far preferable to badmouthing former coworkers.

I thought Superalignment was a positive bet by OpenAI, and I was happy when they committed to putting 20% of their current compute (at the time) towards it. I stopped thinking about that kind of approach because OAI already had competent people working on it. Several of them are now gone.

It seems increasingly likely that the entire effort will dissolve. If so, OAI has now made the business decision to invest its capital in keeping its moat in the AGI race rather than basic safety science. This is bad and likely another early sign of what's to come.

I think the research that was done by the Superalignment team should continue happen outside of OpenAI and, if governments have a lot of capital to allocate, they should figure out a way to provide compute to continue those efforts. Or maybe there's a better way forward. But I think it would be pretty bad if all that talent towards the project never gets truly leveraged into something impactful.

3Bogdan Ionut Cirstea
Strongly agree; I've been thinking for a while that something like a public-private partnership involving at least the US government and the top US AI labs might be a better way to go about this. Unfortunately, recent events seem in line with it not being ideal to only rely on labs for AI safety research, and the potential scalability of automating it should make it even more promising for government involvement. [Strongly] oversimplified, the labs could provide a lot of the in-house expertise, the government could provide the incentives, public legitimacy (related: I think of a solution to aligning superintelligence as a public good) and significant financial resources.
1kromem
It's going to have to. Ilya is brilliant and seems to really see the horizon of the tech, but maybe isn't the best at the business side to see how to sell it. But this is often the curse of the ethically pragmatic. There is such a focus on the ethics part by the participants that the business side of things only sees that conversation and misses the rather extreme pragmatism. As an example, would superaligned CEOs in the oil industry fifty years ago have still only kept their eye on quarterly share prices or considered long term costs of their choices? There's going to be trillions in damages that the world has taken on as liabilities that could have been avoided with adequate foresight and patience. If the market ends up with two AIs, one that will burn down the house to save on this month's heating bill and one that will care if the house is still there to heat next month, there's a huge selling point for the one that doesn't burn down the house as long as "not burning down the house" can be explained as "long term net yield" or some other BS business language. If instead it's presented to executives as "save on this month's heating bill" vs "don't unhouse my cats" leadership is going to burn the neighborhood to the ground. (Source: Explained new technology to C-suite decision makers at F500s for years.) The good news is that I think the pragmatism of Ilya's vision on superalignment is going to become clear over the next iteration or two of models and that's going to be before the question of models truly being unable to be controlled crops up. I just hope that whatever he's going to be keeping busy with will allow him to still help execute on superderminism when the market finally realizes "we should do this" for pragmatic reasons and not just amorphous ethical reasons execs just kind of ignore. And in the meantime I think given the present pace that Anthropic is going to continue to lay a lot of the groundwork on what's needed for alignment on the way to s

For anyone interested in Natural Abstractions type research: https://arxiv.org/abs/2405.07987

Claude summary:

Key points of "The Platonic Representation Hypothesis" paper:

  1. Neural networks trained on different objectives, architectures, and modalities are converging to similar representations of the world as they scale up in size and capabilities.

  2. This convergence is driven by the shared structure of the underlying reality generating the data, which acts as an attractor for the learned representations.

  3. Scaling up model size, data quantity, and task diversity leads to representations that capture more information about the underlying reality, increasing convergence.

  4. Contrastive learning objectives in particular lead to representations that capture the pointwise mutual information (PMI) of the joint distribution over observed events.

  5. This convergence has implications for enhanced generalization, sample efficiency, and knowledge transfer as models scale, as well as reduced bias and hallucination.

Relevance to AI alignment:

  1. Convergent representations shaped by the structure of reality could lead to more reliable and robust AI systems that are better anchored to the real worl

... (read more)
4Gunnar_Zarncke
I recommend making this into a full link-post. I agree about the relevance for AI alignment. 
4cubefox
This sounds really intriguing. I would like someone who is familiar with natural abstraction research to comment on this paper.
1Lorxus
I am very very vaguely in the Natural Abstractions area of alignment approaches. I'll give this paper a closer read tomorrow (because I promised myself I wouldn't try to get work done today) but my quick quick take is - it'd be huge if true, but there's not much more than that there yet, and it also has no argument that even if representations are converging for now, that it'll never be true that (say) adding a whole bunch more effectively-usable compute means that the AI no longer has to chunk objectspace into subtypes rather than understanding every individual object directly.

I thought this series of comments from a former DeepMind employee (who worked on Gemini) were insightful so I figured I should share.

From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.

It's also know that more capable models exploit loopholes in reward functions better. Imo, it's a pretty intuitive idea that more capable RL agents will find larger rewards. But there's evidence from papers like this as well: https://arxiv.org/abs/2201.03544 

To be clear, I don't think the current paradigm as-is is dangerous. I'm stating the obvious because this platform has gone a bit bonkers.

The danger comes from finetuning LLMs to become AutoGPTs which have memory, actions, and maximize rewards, and are deployed autonomously. Widepsread proliferation of GPT-4+ models will almost certainly make lots of these agents which will cause a lot of damage and potentially cause something ind

... (read more)
7leogao
"larger models exploit the RM more" is in contradiction with what i observed in the RM overoptimization paper. i'd be interested in more analysis of this
4Algon
In that paper did you guys take a good long look at the output of various sized models throughout training? In addition to looking at the graphs of gold-standard/proxy reward model ratings against KL-divergence. If not, then maybe that's the discrepancy: perhaps Sherjil was communicating with the LLM and thinking "this is not what we wanted". 

Why aren't you doing research on making pre-training better for alignment?

I was on a call today, and we talked about projects that involve studying how pre-trained models evolve throughout training and how we could guide the pre-training process to make models safer. For example, could models trained on synthetic/transformed data make models significantly more robust and essentially solve jailbreaking? How about the intersection of pretraining from human preferences and synthetic data? Could the resulting model be significantly easier to control? How would it impact the downstream RL process? Could we imagine a setting where we don't need RL (or at least we'd be able to confidently use resulting models to automate alignment research)? I think many interesting projects could fall out of this work.

So, back to my main question: why aren't you doing research on making pre-training better for alignment? Is it because it's too expensive and doesn't seem like a low-hanging fruit? Or do you feel it isn't a plausible direction for aligning models?

We were wondering if there are technical bottlenecks that would make this kind of research more feasible for alignment research to better study ho... (read more)

3jacquesthibs
Synthesized various resources for this "pre-training for alignment" type work: * Data * Synthetic Data * The RetroInstruct Guide To Synthetic Text Data * Alignment In The Age of Synthetic Data * Leveraging Agentic AI for Synthetic Data Generation * **AutoEvol**: Automatic Instruction Evolving for Large Language Models We build a fully automated Evol-Instruct pipeline to create high-quality, highly complex instruction tuning data * Synthetic Data Generation and AI Feedback notebook * The impact of models training on their own outputs and how its actually done well in practice * Google presents Best Practices and Lessons Learned on Synthetic Data for Language Models *   * Transformed/Enrichment of Data * Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web! * Better Synthetic Data by Retrieving and Transforming Existing Datasets * Rho-1: Not All Tokens Are What You Need RHO-1-1B and 7B achieves SotA results of 40.6% and 51.8% on MATH dataset, respectively — matching DeepSeekMath with only 3% of the pretraining tokens. * Data Attribution * In-Run Data Shapley * Scaling Laws for the Value of Individual Data Points in Machine Learning We show how some data points are only valuable in small training sets; others only shine in large datasets. * What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions *   * Data Mixtures * Methods for finding optimal data mixture * RegMix: Data Mixture as Regression for Language Model Pre-training * Curriculum Learning * On transforming data into a curriculum to improve learning efficiency and capability * Curriculum learning that actually works? * Active Data Selection * MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models MATE
1myyycroft
GPT-2 1.5B is small by today's standards. I hypothesize people are not sure if findings made for models of this scale will generalize to frontier models (or at least to the level of LLaMa-3.1-70B), and that's why nobody is working on it. However, I was impressed by "Pre-Training from Human Preferences". I suppose that pretraining could be improved, and it would be a massive deal for alignment.
1eggsyntax
One key question here, I think: a major historical alignment concern has been that for any given finite set of outputs, there are an unbounded number of functions that could produce it, and so it's hard to be sure that a model will generalize in a desirable way. Nora Belrose goes so far as to suggest that 'Alignment worries are quite literally a special case of worries about generalization.' This is relevant for post-training but I think even more so for pre-training. I know that there's been research into how neural networks generalize both from the AIS community and the larger ML community, but I'm not very familiar with it; hopefully someone else can provide some good references here.

If you work at a social media website or YouTube (or know anyone who does), please read the text below:

Community Notes is one of the best features to come out on social media apps in a long time. The code is even open source. Why haven't other social media websites picked it up yet? If they care about truth, this would be a considerable step forward beyond. Notes like “this video is funded by x nation” or “this video talks about health info; go here to learn more” messages are simply not good enough.

If you work at companies like YouTube or know someone who does, let's figure out who we need to talk to to make it happen. Naïvely, you could spend a weekend DMing a bunch of employees (PMs, engineers) at various social media websites in order to persuade them that this is worth their time and probably the biggest impact they could have in their entire career.

If you have any connections, let me know. We can also set up a doc of messages to send in order to come up with a persuasive DM.

5jacquesthibs
Don't forget that we train language models on the internet! The more truthful your dataset is, the more truthful the models will be! Let's revamp the internet for truthfulness, and we'll subsequently improve truthfulness in our AI systems!!
2Viliam
I don't use Xitter; is there a way to display e.g. top 100 tweets with community notes? To see how it works in practice.
6Yoav Ravid
I don't know of something that does so at random, but this page automatically shares posts with community notes that have been deemed helpful.

Oh, that’s great, thanks! Also reminded me of (the less official, more comedy-based) “Community Notes Violating People”. @Viliam 

2Viliam
Thank you both! This is perfect. It's like a rational version of Twitter, and I didn't expect to use those words in the same sentence.
2jacquesthibs
I don’t think so, unfortunately.
2Viliam
Found a nice example (linked from Zvi's article). Okay, it's just one example and it wasn't found randomly, but I am impressed.
2jacquesthibs
I've also started working on a repo in order to make Community Notes more efficient by using LLMs.
2ChristianKl
This sounds a bit naive.  There's a lot of energy invested in making it easier for powerful elites to push their preferred narratives. Community Notes are not in the interests of the Censorship Industrial Complex. I don't think that anyone at the project manager level has the political power to add a feature like Community Notes. It would likely need to be someone higher up in the food chain. 
9jacquesthibs
Sure, but sometimes it's just a PM and a couple of other people that lead to a feature being implemented. Also, keep in mind that Community Notes was a thing before Musk. Why was Twitter different than other social media websites? Also, the Community Notes code was apparently completely revamped by a few people working on the open-source code, which got it to a point where it was easy to implement, and everyone liked the feature because it noticeably worked. Either way, I'd rather push for making it happen and somehow it fails on other websites than having pessimism and not trying at all. If it needs someone higher up the chain, let's make it happen.
4ChristianKl
Twitter seems to have started Birdwatch as a small separate pilot project where it likely wasn't easy to fight or on anyone's radar to fight.  In the current enviroment, where X gets seen as evil by a lot of the mainstream media, I would suspect that copying Community Notes from X would alone produce some resistence. The antibodies are now there in a way they weren't two years ago.  If you look at mainstream media views about X's community notes, I don't think everyone likes it.  I remember Elon once saying that he lost a 8-figure advertising deal because of Community Notes on posts of a company that wanted to advertise on X. I think you would likely need to make a case that it's good business in addition to helping with truth.  If you want to make your argument via truth, motivating some reporters to write favorable articles about Community Notes might be necessary. 
2jacquesthibs
Good points; I'll keep them all in mind. If money is the roadblock, we can put pressure on the companies to do this. Or, worst-case, maybe the government can enforce it (though that should be done with absolute care).
2jacquesthibs
I shared a tweet about it here: https://x.com/JacquesThibs/status/1724492016254341208?s=20 Consider liking and retweeting it if you think this is impactful. I'd like it to get into the hands of the right people.
1Bruce Lewis
I had not heard of Community Notes. Interesting anti-bias technique "notes require agreement between contributors who have sometimes disagreed in their past ratings". https://communitynotes.twitter.com/guide/en/about/introduction
2jacquesthibs
I've been on Twitter for a long time, and there's pretty much unanimous agreement that it works amazingly well in practice!
1Kabir Kumar
there is an issue with surface level insights being unfaily weighted, but this is solvable, imo. especially with youtube, which can see which commenters have watched the full video.

I quickly wrote up some rough project ideas for ARENA and LASR participants, so I figured I'd share them here as well. I am happy to discuss these ideas and potentially collaborate on some of them.

Alignment Project Ideas (Oct 2, 2024)

1. Improving "A Multimodal Automated Interpretability Agent" (MAIA)

Overview

MAIA (Multimodal Automated Interpretability Agent) is a system designed to help users understand AI models by combining human-like experimentation flexibility with automated scalability. It answers user queries about AI system components by iteratively generating hypotheses, designing and running experiments, observing outcomes, and updating hypotheses.

MAIA uses a vision-language model (GPT-4V, at the time) backbone equipped with an API of interpretability experiment tools. This modular system can address both "macroscopic" questions (e.g., identifying systematic biases in model predictions) and "microscopic" questions (e.g., describing individual features) with simple query modifications.

This project aims to improve MAIA's ability to either answer macroscopic questions or microscopic questions on vision models.

2. Making "A Multimodal Automated Interpretability Agent" (MAIA) wor

... (read more)

My current speculation as to what is happening at OpenAI

How do we know this wasn't their best opportunity to strike if Sam was indeed not being totally honest with the board?

Let's say the rumours are true, that Sam is building out external orgs (NVIDIA competitor and iPhone-like competitor) to escape the power of the board and potentially going against the charter. Would this 'conflict of interest' be enough? If you take that story forward, it sounds more and more like he was setting up AGI to be run by external companies, using OpenAI as a fundraising bargaining chip, and having a significant financial interest in plugging AGI into those outside orgs.

So, if we think about this strategically, how long should they wait as board members who are trying to uphold the charter?

On top of this, it seems (according to Sam) that OpenAI has made a significant transformer-level breakthrough recently, which implies a significant capability jump. Long-term reasoning? Basically, anything short of 'coming up with novel insights in physics' is on the table, given that Sam recently used that line as the line we need to cross to get to AGI.

So, it could be a mix of, Ilya thinking they have achieved AG... (read more)

3jacquesthibs
Obviously, a lot has happened since the above shortform, but regarding model capabilities (which discussions died down these last couple of days), there's now this: Source: https://www.reuters.com/technology/sam-altmans-ouster-openai-was-precipitated-by-letter-board-about-ai-breakthrough-2023-11-22/ 
4jacquesthibs
So, apparently, there are two models, but only Q* is mentioned in the article. Won't share the source, but:
3jacquesthibs
Update, board members seem to be holding their ground more than expected in this tight situation:

News on the next OAI GPT release:

Nagasaki, CEO of OpenAI Japan, said, "The AI ​​model called 'GPT Next' that will be released in the future will evolve nearly 100 times based on past performance. Unlike traditional software, AI technology grows exponentially." 

https://www.itmedia.co.jp/aiplus/articles/2409/03/news165.html 

The slide clearly states 2024 "GPT Next". This 100 times increase probably does not refer to the scaling of computing resources, but rather to the effective computational volume + 2 OOMs, including improvements to the architecture and learning efficiency. GPT-4 NEXT, which will be released this year, is expected to be trained using a miniature version of Strawberry with roughly the same computational resources as GPT-4, with an effective computational load 100 times greater. Orion, which has been in the spotlight recently, was trained for several months on the equivalent of 100k H100 compared to GPT-4 (EDIT: original tweet said 10k H100s, but that was a mistake), adding 10 times the computational resource scale, making it +3 OOMs, and is expected to be released sometime next year.

Note: Another OAI employee seemingly confirms this (I've followed... (read more)

8Vladimir_Nesov
This implies successful use of FP8, if taken literally in a straightforward way. In BF16 an H100 gives 1e15 FLOP/s (in dense tensor compute). With 40% utilization over 10 months, 10K H100s give 1e26 FLOPs, which is only 5 times higher than the rumored 2e25 FLOPs of original GPT-4. To get to 10 times higher requires some 2x improvement, and the evident way to get that is by transitioning from BF16 to FP8. I think use of FP8 for training hasn't been confirmed to be feasible at GPT-4 level scale (Llama-3-405B uses BF16), but if it does work, that's a 2x compute increase for other models as well. This text about Orion and 10K H100s only appears in the bioshok3 tweet itself, not in the quoted news article, so it's unclear where the details come from. The "10 times the computational resource scale, making it +3 OOMs" hype within the same sentence also hurts credence in the numbers being accurate (10 times, 10K H100s, several months). Another implication is that Orion is not the 100K H100s training run (that's probably currently ongoing). Plausibly it's an experiment with training on a significant amount of synthetic data. This suggests that the first 100K H100s training run won't be experimenting with too much synthetic training data yet, at least in pre-training. The end of 2025 point for significant advancement in quality might then be referring to the possibility that Orion succeeds and its recipe is used in another 100K H100s scale run, which might be the first hypothetical model they intend to call "GPT-5". The first 100K H100s run by itself (released in ~early 2025) would then be called "GPT-4.5o" or something (especially if Orion does succeed, so that "GPT-5" remains on track).
8dirk
Bioshok3 said in a later tweet that they were in any case mistaken about it being 10k H100s and it was actually 100k H100s: https://x.com/bioshok3/status/1831016098462081256 
4Vladimir_Nesov
Surprisingly, there appears to be an additional clue for this in the wording: 2e26 BF16 FLOPs take 2.5 months on 100K H100s at 30% utilization, while the duration of "several months" is indicated by the text "数ヶ月" in the original tweet. GPT-4o explains it to mean So the interpretation that fits most is specifically 2-3 months (Claude says 2-4 months, Grok 3-4 months), close to what the calculation for 100K H100s predicts. And this is quite unlike the requisite 10 months with 10K H100s in FP8.
2ryan_greenblatt
My guess is that this is just false / hallucinated.
4ryan_greenblatt
"Orion is 10x compute" seems plausible, "Orion was trained on only 10K H100s" does not seem plausible if it is actually supposed to be 10x raw compute. Around 50K H100s does seem plausible and would correspond to about 10x compute assuming a training duration similar to GPT-4.
4Vladimir_Nesov
Within this hypothetical, Orion didn't necessarily merit the use of the largest training cluster, while time on 10K H100s is something mere money can buy without impacting other plans. GPT-4o is itself plausibly at 1e26 FLOPs level already, since H100s were around for more than a year before it came out (1e26 FLOPs is 5 months on 20K H100s). It might be significantly overtrained, or its early fusion multimodal nature might balloon the cost of effective intelligence. Gemini 1.0 Ultra, presumably also an early fusion model with rumored 1e26 FLOPs, similarly wasn't much better than Mar 2023 GPT-4. Though Gemini 1.0 is plausibly dense, given how the Gemini 1.5 report stressed that 1.5 is MoE, so that might be a factor in how 1e26 FLOPs didn't get it too much of an advantage. So if GPT-4o is not far behind in terms of FLOPs, a 2e26 FLOPs Orion wouldn't be a significant improvement unless the synthetic data aspect works very well, and so there would be no particular reason to rush it. On the other hand GPT-4o looks like something that needed to be done as fast as possible, and so the largest training cluster went to it and not Orion. The scaling timelines are dictated by building of largest training clusters, not by decisions about use of smaller training clusters.
2ryan_greenblatt
This tweet also claims 10k H100s while citing the same article that doesn't mention this.
1LuigiPagani
Are you sure he is an OpenAi employee?

I encourage alignment/safety people to be open-minded about what François Chollet is saying in this podcast:

I think many are blindly bought into the 'scale is all you need' and apparently godly nature of LLMs and may be dependent on unfounded/confused assumptions because of it.

Getting this right is important because it could significantly impact how hard you think alignment will be. Here's @johnswentworth responding to @Eliezer Yudkowsky about his difference in optimism compared to @Quintin Pope (despite believing the natural abstraction hypothesis is true):

Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI's internal ontology at training time.  My guess is that more of the disagreement lies here.

I doubt much disagreement between you and I lies there, because I do not expect ML-style training to robustly point an AI in any builder-intended direction. My hopes generally don't route through targeting via ML-style training.

I do think my deltas from many other people lie there - e.g. that

... (read more)
7Mitchell_Porter
In my opinion, this does not correspond to a principled distinction at the level of computation.  For intelligences that employ consciousness in order to do some of these things, there may be a difference in terms of mechanism. Reasoning and pattern matching sound like they correspond to different kinds of conscious activity.  But if we're just talking about computation... a syllogism can be implemented via pattern matching, a pattern can be completed by a logical process (possibly probabilistic). 
7jacquesthibs
Perhaps, but deep learning models are still failing at ARC. My guess (and Chollet's) is that they will continue to fail at ARC unless they are trained on that kind of data (which goes against the point of the benchmark) or you add something else that actually resolves this failure in deep learning models. It may be able to pattern-match to reasoning-like behaviour, but only if specifically trained on that kind of data. No matter how much you scale it up, it will still fail to generalize to anything not local in its training data distribution.
4Seth Herd
I think this is exactly right. The phrasing is a little confusing. I'd say "LLMs can't solve truly novel problems". But the implication that this is a slow route or dead-end for AGI is wrong. I think it's going to be pretty easy to scaffold LLMs into solving novel problems. I could be wrong, but don't bet heavily on it unless you happen to know way more about cognitive psychology and LLMs in combination than I do. it would be foolish to make a plan for survival that relies on this being a major delay. I can't convince you of this without describing exactly how I think this will be relatively straightforward, and I'm not ready to advance capabilities in this direction yet. I think language model agents are probably our best shot at alignment, so we should probably actively work on advancing them to AGI; but I'm not sure enough yet to start publishing my best theories on how to do that. Back to the possibly confusing phrasing Chollet uses: I think he's using Piaget's definition of intelligence as "what you do when you don't know what to do" (he quotes this in the interview). That's restricting it to solving problems you haven't memorized an approach to. That's not how most people use the word intelligence. When he says LLMs "just memorize", he's including memorizing programs or approaches to problems, and they can plug the variables of this particular variant of the problem in to those memorized programs/approaches. I think the question "well maybe that's all you need to do" raised by Patel is appropriate; it's clear they can't do enough of this yet, but it's unclear if further progress will get them to another level of abstraction of an approach so abstract and general that it can solve almost any problem. I think he's on the wrong track with the "discrete program search" because I see more human-like solutions that may be lower-hanging fruit, but I wouldn't bet his approach won't work. I'm starting to think that there are many approaches to general intelligence
2quetzal_rainbow
Okay, hot take: I don't think that ARC tests "system 2 reasoning" and "solving novel tasks", at least, in humans. When I see simple task, I literally match patterns, when I see complex task I run whatever patterns I can invent until they match. I didn't run the entire ARC testing dataset, but if I am good at solving it, it will be because I am fan of Myst-esque games and, actually, there are not so many possible principles in designing problems of this sort. What failure of LLMs to solve ARC is actually saying us, it is "LLM cognition is very different from human cognition".
3jacquesthibs
They've tested ARC with children and Mechanical Turk workers, and they all seem to do fine despite the average person not being a fan of "Myst-esque games." Do you believe LLMs are just a few OOMs away from solving novel tasks like ARC? What is different that is not explained by what Chollet is saying?
2quetzal_rainbow
By "good at solving" I mean "better than average person". I think the fact that language model are better at predicting next token than humans implies that LLMs have sophisticated text-oriented cognition and saying "LLMs are not capable to solve ARC, therefore, they are less intelligent than children" is equivalent to saying "humans can't take square root of 819381293787, therefore, they are less intelligent than calculator".  My guess that probably we would need to do something non-trivial to scale LLM to superintelligence, but I don't expect that it is necessary to move from general LLM design principles. 
4jacquesthibs
Of course, I acknowledge that LLMs are better at many tasks than children. Those tasks just happen to all be within its training data distribution and not on things that are outside of it. So, no, you wouldn't say the calculator is more intelligent than the child, but you might say that it has an internal program that allows it to be faster and more accurate than a child. LLMs have such programs they can use via pattern-matching too, as long as it falls into the training data distribution (in the case of Caesar cypher, apparently it doesn't do so well for number nine – because it's simply less common in its training data distribution). One thing that Chollet does mention that helps to alleviate the limitation of deep learning is to have some form of active inference:
2quetzal_rainbow
Let's start with the end: Why do you think that they don't already do that?  My point is that children can solve ARC not because they have some amazing abstract spherical-in-vacuum reasoning abilities which LLMs lack, but because they have human-specific pattern recognition ability (like geometric shapes, number sequences, music, etc). Brains have strong inductive biases, after all. If you train a model purely on the prediction of a non-anthropogenic physical environment, I think this model will struggle with solving ARC even if it has a sophisticated multi-level physical model of reality, because regular ARC-style repeating shapes are not very probable on priors. In my impression, in debates about ARC, AI people do not demonstrate a very high level of deliberation. Chollet and those who agree with him are like "nah, LLMs are nothing impressive, just interpolation databases!" and LLM enthusiasts are like "scaling will solve everything!!!!111!" Not many people seem to consider "something interesting is going on here. Maybe we can learn something important about how humans and LLMs work that doesn't fit into simple explanation templates."
1eggsyntax
  Since AFAIK in-context learning functions pretty similarly to fine-tuning (though I haven't looked into this much), it's not clear to me why Chollet sees online fine-tuning as deeply different from few-shot prompting. Certainly few-shot prompting works extremely well for many tasks; maybe it just empirically doesn't help much on this one?
1eggsyntax
  As per "Transformers learn in-context by gradient descent", which Gwern also mentions in the comment that @quetzal_rainbow links here.  
1Morpheus
Looking at how gpt-4 did on the benchmark when I gave it some screenshots, the thing it failed at was the visual "pattern matching" (things completely solved by my system 1) rather than the abstract reasoning.
2jacquesthibs
Yes, the point is that it can’t pattern match because it has never seen such examples. And, as humans, we are able to do well on the task because we don’t simply rely on pattern matching, we use system 2 reasoning (in addition) to do well on such a novel task. Given that the deep learning model relies on pattern matching, it can’t do the task.
3Morpheus
I think humans just have a better visual cortex and expect this benchmark too to just fall with scale.
2jacquesthibs
As Chollet says in the podcast, we will see if multimodal models crack ARC in the next year, but I think researchers should start paying attention rather than dismissing if they are incapable of doing so in the next year. But for now, “LLMs do fine with processing ARC-like data by simply fine-tuning an LLM on subsets of the task and then testing it on small variation.” It encodes solution programs just fine for tasks it has seen before. It doesn’t seem to be an issue of parsing the input or figuring out the program. For ARC, you need to synthesize a new solution program on the fly for each new task.
1Morpheus
Would it change your mind if gpt-4 was able to do the grid tasks if I manually transcribed them to different tokens? I tried to manually let gpt-4 turn the image to a python array, but it indeed has trouble performing just that task alone.
1Morpheus
For concreteness. In this task it fails to recognize that all of the cells get filled, not only the largest one. To me that gives the impression that the image is just not getting compressed really well and the reasoning gpt-4 is doing is just fine.
1[comment deleted]
1eggsyntax
There are other interesting places where LLMs fail badly at reasoning, eg planning problems like block-world or scheduling meetings between people with availability constraints; see eg this paper & other work from Kambhampati. I've been considering putting some time into this as a research direction; the ML community has a literature on the topic but it doesn't seem to have been discussed much in AIS, although the ARC prize could change that. I think it needs to be considered through a safety lens, since it has significant impacts on the plausibility of short timelines to drop-in-researcher like @leopold's. I have an initial sketch of such a direction here, combining lit review & experimentation. Feedback welcomed! (if in fact someone already has looked at this issue through an AIS lens, I'd love to know about it!)
1Morpheus
I don't get it. I just looked at ARC and it seemed obvious that gpt-4/gpt-4o can easily solve these problems by writing python. Then I looked it up on papers-with-code and it seems close to solved? Probably the ones remaining would be hard for children also. Did the benchmark leak into the training data and that is why they don't count them?
3dmz
Unfortunate name collision: you're looking at numbers on the AI2 Reasoning Challenge, not Chollet's Abstraction & Reasoning Corpus.
1Morpheus
Thanks for clarifying! I just tried a few simple ones by prompting gpt-4o and gpt-4 and it does absolutely horrific job! Maybe trying actually good prompting could help solving it, but this is definitely already an update for me!

Attempt to explain why I think AI systems are not the same thing as a library card when it comes to bio-risk.

To focus on less of an extreme example, I’ll be ignoring the case where AI can create new, more powerful pathogens faster than we can create defences, though I think this is an important case (some people just don’t find it plausible because it relies on the assumption that AIs being able to create new knowledge).

I think AI Safety people should make more of an effort to walkthrough the threat model so I’ll give an initial quick first try:

1) Library. If I’m a terrorist and I want to build a bioweapon, I have to spend several months reading books at minimum to understand how it all works. I don’t have any experts on-hand to explain how to do it step-by-step. I have to figure out which books to read and in what sequence. I have to look up external sources to figure out where I can buy specific materials.

Then, I have to somehow find out how to to gain access to those materials (this is the most difficult part for each case). Once I gain access to the materials, I still need to figure out how to make things work as a total noob at creating bioweapons. I will fail. Even experts fa... (read more)

Resharing a short blog post by an OpenAI employee giving his take on why we have 3-5 year AGI timelines (https://nonint.com/2024/06/03/general-intelligence-2024/):

Folks in the field of AI like to make predictions for AGI. I have thoughts, and I’ve always wanted to write them down. Let’s do that.

Since this isn’t something I’ve touched on in the past, I’ll start by doing my best to define what I mean by “general intelligence”: a generally intelligent entity is one that achieves a special synthesis of three things:

  • A way of interacting with and observing a complex environment. Typically this means embodiment: the ability to perceive and interact with the natural world.
  • A robust world model covering the environment. This is the mechanism which allows an entity to perform quick inference with a reasonable accuracy. World models in humans are generally referred to as “intuition”, “fast thinking” or “system 1 thinking”.
  • A mechanism for performing deep introspection on arbitrary topics. This is thought of in many different ways – it is “reasoning”, “slow thinking” or “system 2 thinking”.

If you have these three things, you can build a generally intelligent agent. Here’s how:

First, you se... (read more)

1Jonas Hallgren
I really like this take. I'm kind of "bullish" on active inference as a way to scale existing architectures to AGI as I think it is more optimised for creating an explicit planning system. Also, Funnily enough, Yann LeCun has a paper on his beliefs on the path to AGI which I think Steve Byrnes has a good post on. It basically says that we need system 2 thinking in the way you said it here. With your argument in mind he kind of disproves himself to some extent. 😅
1kromem
I agree with a lot of those points, but suspect there may be fundamental limits to planning capabilities related to the unidirectionality of current feed forward networks. If we look at something even as simple as how a mouse learns to navigate a labyrinth, there's both a learning of the route to the reward but also a learning of how to get back to the start which adjusts according to the evolving learned layout of the former (see paper: https://elifesciences.org/articles/66175 ). I don't see the SotA models doing well at that kind of reverse planning, and expect that nonlinear tasks are going to pose significant agentic challenges until architectures shift to something new. So it could be 3-5 years to get to AGI depending on hardware and architecture advances, or we might just end up in a sort of weird "bit of both" world where we have models that are beyond expert human level superintelligent in specific scopes but below average in other tasks. But when we finally do get models that in both training and operation exhibit bidirectional generation across large context windows, I think it will only be a very short time until some rather unbelievable goalposts are passed by.

Low-hanging fruit:

Loving this Chrome extension so far: YouTube Summary with ChatGPT & Claude - Chrome Web Store

It adds a button on YouTube videos where, when you click it (or keyboard shortcut ctrl + x + x), it opens a new tab into the LLM chat of your choice, pastes the entire transcript in the chat along with a custom message you can add as a template ("Explain the key points.") and then automatically presses enter to get the chat going.

It's pretty easy to get a quick summary of a YouTube video without needing to watch the whole thing and then ask follow-up questions. It seems like an easy way to save time or do a quick survey of many YouTube videos. (I would not have bothered going through the entire "Team 2 | Lo fi Emulation @ Whole Brain Emulation Workshop 2024" talk, so it was nice to get the quick summary.)

I usually like getting a high-level overview of the key points of a talk to have a mental mind map skeleton before I dive into the details.

You can even set up follow-up prompt buttons (which works with ChatGPT but currently does not work with Claude for me), though I'm not sure what I'd use. Maybe something like, "Why is this important to AI alignment?"

The default prom... (read more)

5JaimeRV
I used to use that one but I moved to Sider: https://sider.ai/pricing?trigger=ext_chrome_btm_upgrd it works in all the pages, including youtube. For Papers and articles I have shortcut to automatically modify the url (adding the prefix "https://r.jina.ai/") so you get the markdown and then do Sider on that. With gpt4o-mini it is almost free. Also nice is Sider is that you can write your own prompt templates
2jacquesthibs
Thanks for sharing, will give it a shot! Edit: Sider seems really great! I wish it could connect to Claude chat (without using credits), so I will probably just use both extensions.

Dario Amodei believes that LLMs/AIs can be aided to self-improve in a similar way to AlphaGo Zero (though LLMs/AIs will benefit from other things too, like scale), where the models can learn by themselves to gain significant capabilities.

The key for him is that Go has a set of rules that the AlphaGo model needs to abide by. These rules allow the model to become superhuman at Go with enough compute.

Dario essentially believes that to reach better capabilities, it will help to develop rules for all the domains we care about and that this will likely be possible for more real-world tasks (not just games like Go).

Therefore, I think the crux here is if you think it is possible to develop rules for science (physics, chemistry, math, biology) and other domains s.t., the models can do this sort of self-play to become superhuman for each of the things we care about.

So far, we have examples like AlphaGeometry, which relies on our ability to generate many synthetic examples to help the model learn. This makes sense for the geometry use case, but how do we know if this kind of approach will work for the kinds of things we actually care about? For games and geometry, this seems possible, but wha... (read more)

4jacquesthibs
Hey @Zac Hatfield-Dodds, I noticed you are looking for citations; these are the interview bits I came across (and here at 47:31). It's possible I misunderstood him; please correct me if I did!
3Zac Hatfield-Dodds
I don't think any of these amount to a claim that "to reach ASI, we simply need to develop rules for all the domains we care about". Yes, AlphaGo Zero reached superhuman levels on the narrow task of playing Go, and that's a nice demonstration that synthetic data could be useful, but it's not about ASI and there's no claim that this would be either necessary or sufficient. (not going to speculate on object-level details though)
2jacquesthibs
Ok, totally; there's no specific claim about ASI. Will edit the wording.
4Seth Herd
I think this type of autonomous learning is fairly likely to be achieved soon (1-2 years), and it doesn't need to follow exactly AlphaZero's self-play model. The world has rules. Those rules are much more complex and stochastic than games or protein folding. But note that the feedback in Go comes only after something like 200 moves, yet the powerful critic head is able to use that to derive a good local estimate of what's likely a good or bad move. Humans use a similar powerful critic in the dopamine system working in concert with the cortex's rich world model to decide what's rewarding long before there's a physical reward or punishment signal. This is one route to autonomous learning for LLM agents. I don't know if Amodei is focused on base models or hybrid learning systems, and that matters. Or maybe it doesn't. I can think of more human-like ways of autonomous learning in a hybrid system, but a powerful critic may be adequate for self-play even in a base model. Existing RLHF techniques do use a critic - I think it's proximal policy optimization (or DPO?) in the last OpenAI setup they publicly reported. (I haven't looked at Anthropic's RLAIF setup to see if they're using a similar critic portion of the model- I'd guess they are, following OpenAIs success with it). I'd expect they're experimenting with using small sets of human feedback to leverage self-critique as in RLAIF, making a better critic that makes a better overall model. Decomposing video into text and then predicting how people behave both physically and emotionally offer two new windows onto the rules of the world. I guess those aren't quite in the self-play domain on their own, but having good predictions of outcomes might allow autonomous learning of agentic actions by taking feedback not from a real or simulated world, but from that trained predictor of physical and social outcomes. Deriving a feedback signal directly from the world can be done in many ways. I expect there are more clever ide
2Seth Herd
Glancing back at this, I noted I missed the most obvious form of self-play: putting an agent in an interaction with another copy of itself. You could do any sort of "scoring" by having an automated of the outcome vs. the current goal. This has some obvious downsides, in that the agents aren't the same as people. But it might get you a good bit of extra training that predicting static datasets doesn't give. A little interaction with real humans might be the cherry on top of the self-play whipped cream on the predictive learning sundae.
3davekasten
I am fairly skeptical that we don't already have something close-enough-to-approximate this if we had access to all the private email logs of the relevant institutions matched to some sort of correlation of "when this led to an outcome" metric (e.g., when was the relevant preprint paper or strategy deck or whatever released)
2RogerDearnaley
Go has rules, and gives you direct and definitive feedback on how well you're doing, but, while a very large space, it isn't open-ended. A lot of the foundation model companies appear to be busily thinking about doing something AlphaZero-inspired in mathematics, which also has rules, and can be arranged to give you direct feedback on how you're doing (there have been recent papers on how to make this more efficient with less human input). Similarly on writing and debugging software, likewise. Indeed, models have recently been getting better at Math and coding faster than other topics, suggesting that they're making real progress. When I watched that Dario interview (the Scandinavian bank one, I assume) my assumption was that Dario was talking about those, but using AlphaGo as a clearer and more widely-familiar example. Expanding this to other areas seems like it would come next: robotics seems a promising one that also gives you a lot of rapid feedback, science would be fascinating and exciting but the feedback loops are a lot longer, human interactions (on something like the Character AI platform) seem like another possibility (though the result of that might be models better at human manipulation and/or pillow-talk, which might not be entirely a good thing).

Alignment Researcher Assistant update.

Hey everyone, my name is Jacques, I'm an independent technical alignment researcher, primarily focused on evaluations, interpretability, and scalable oversight (more on my alignment research soon!). I'm now focusing more of my attention on building an Alignment Research Assistant (I've been focusing on my alignment research for 95% of my time in the past year). I'm looking for people who would like to contribute to the project. This project will be private unless I say otherwise (though I'm listing some tasks); I understand the dual-use nature and most criticism against this kind of work.

How you can help:

  • Provide feedback on what features you think would be amazing in your workflow to produce high-quality research more efficiently.
  • Volunteer as a beta-tester for the assistant.
  • Contribute to one of the tasks below. (Send me a DM, and I'll give you access to the private Discord to work on the project.)
  • Funding to hire full-time developers to build the features.

Here's the vision for this project:

How might we build an AI system that augments researchers to get us 5x or 10x productivity for the field as a whole?

The system is designed with two main minds

... (read more)
3jacquesthibs
We're doing a hackathon with Apart Research on 26th. I created a list of problem statements for people to brainstorm off of. Pro-active insight extraction from new research Reading papers can take a long time and is often not worthwhile. As a result, researchers might read too many papers or almost none. However, there are still valuable nuggets in papers and posts. The issue is finding them. So, how might we design an AI research assistant that proactively looks at new papers (and old) and shares valuable information with researchers in a naturally consumable way? Part of this work involves presenting individual research with what they would personally find valuable and not overwhelm them with things they are less interested in. How can we improve the LLM experience for researchers? Many alignment researchers will use language models much less than they would like to because they don't know how to prompt the models, it takes time to create a valuable prompt, the model doesn't have enough context for their project, the model is not up-to-date on the latest techniques, etc. How might we make LLMs more useful for researchers by relieving them of those bottlenecks? Simple experiments can be done quickly, but turning it into a full project can take a lot of time  One key bottleneck for alignment research is transitioning from an initial 24-hour simple experiment in a notebook to a set of complete experiments tested with different models, datasets, interventions, etc. How can we help researchers move through that second research phase much faster? How might we use AI agents to automate alignment research? As AI agents become more capable, we can use them to automate parts of alignment research. The paper "A Multimodal Automated Interpretability Agent" serves as an initial attempt at this. How might we use AI agents to help either speed up alignment research or unlock paths that were previously inaccessible? How can we nudge research toward better objectives (age
2jacquesthibs
This just got some massive downvotes. Would like to know why. My guess is "This can be dual-use. Therefore, it's bad," but if not, it would be nice to know.

I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.

Here's the summary introduction:

12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to supervise AIs that are improving AIs to ensure stable alignment.

Summary

  • Agenda 1Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research. Could use additional funding to hire an engineer and builder, which could evolve into an AI Safety organization focused on this agenda. Recent talk giving a partial overview of the agenda.
  • Agenda 2Supervising AIs Improving AIs (through self-training or training other AIs). Publish a paper and create an automated pipeline for discovering noteworthy changes in
... (read more)
1mesaoptimizer
Can you give concrete use-cases that you imagine your project would lead to helping alignment researchers? Alignment researchers have wildly varying styles of work outputs and processes. I assume you aim to accelerate a specific subset of alignment researchers (those focusing on interpretability and existing models and have an incremental / empirical strategy for solving the alignment problem).

Recent paper I thought was cool:

In-Run Data Shapley: Data attribution method efficient enough for pre-training data attribution.

Essentially, it can track how individual data points (or clusters) impact model performance across pre-training. You just need to develop a set of validation examples to continually check the model's performance on those examples during pre-training. Amazingly, you can do this over the course of a single training run; no need to require multiple pre-training runs like other data attribution methods have required.

Other methods, like influence functions, are too computationally expensive to run during pre-training and can only be run post-training.

So, here's why this might be interesting from an alignment perspective:

  • You might be able to set up a bunch of validation examples to test specific behaviour in the models so that we are hyper-aware of which data points contribute the most to that behaviour. For example, self-awareness or self-preservation.
  • Given that this is possible to run during pre-training, you might understand model behaviour at such a granular level that you can construct data mixtures/curriculums that push the model towards internalizing 'hum
... (read more)
7jacquesthibs
I sent some related project ideas to @RogerDearnaley via DMs, but figured I should share them here to in case someone would like to give feedback or would like to collaborate on one of them. ---------------------------------------- I think data is underrated among the alignment community (synthetic/transformed data even more). I have been thinking about it from the perspective of pre-training and post-training. My initial look into synthetic data was related to online learning and essentially controlling model behaviour. I was interested in papers like this one by Google, where they significantly reduce sycophancy in an LLM via 1k synthetically generated examples. Data shapes behaviour, and I think many people do not acknowledge this enough (which sometimes leads them to make confused conclusions about model behaviour). In terms of specific research projects, my current ideas fall into these kinds of buckets: Pre-training close to the basin of attraction for alignment * How much can we improve "Pretraining Language Models with Human Preferences"? I'd like to transform training in various ways (as mentioned in your posts). For example, I could take fineweb and pre-train a GPT-2 sized model with the original dataset and a transformed version. Unclear so far which things I'd like to measure the most at that model size, though. A downstream experiment: is one model more likely to reward hack over the other? Does shard theory help us come up with useful experiments (pre-training with human feedback is almost like reinforcing behaviour and leveraging some form of shard theory)? Note that Google used a similar pre-training scheme for PaLM 2: * * How can the "basin of attraction for alignment" be mathematically formalized? * Trying to the impact of systematic errors: * Studying reward misspecification: do the reward labels have a systematic effect and bias in pushing the model? How much of the model's behaviour is determined by the data itself vs. the reward
7RogerDearnaley
I love this idea! Thanks for suggesting it. (It is of course, not a Bitter Lesson approach, but may well still be a great idea.) Another area where being able to do this efficiently at scale is going to be really important is once models start showing dangerous levels of capability on WMB-dangerous chem/bio/radiological/nuclear (CBRN) and self-replication skills. The best way to deal with this is to make sure these skills aren't in the model at all, so the model can't be fine-tuned back to these capabilities (as is required to produce a model of this level where one could at least discuss open-sourcing it, rather than that being just flagrantly crazy and arguably perhaps already illegal), is to omit key knowledge from the training set entirely. Which inevitably isn't going to succeed on the first pass, but this technique applied to the first pass gives us a way to find (hopefully) everything we need to remove from the training set so we can do a second training run that has specific, focused, narrow gaps in its capabilities. And yes, I'm interested in work in this area (my AI day-job allowing).
7Mike Vaiana
Hey, we've been brainstorm ideas about better training strategies for base models and what types of experiments we can run at a small scale (e.g. training gpt-2 ) to get initial information.  I think this idea is really promising and would love to chat about it.
4jacquesthibs
It's cool that you point to @Tomek Korbak because I was wondering if we could think of ways to extend his Pretraining Language Models with Human Preferences paper in ways that Roger mentions in his post. Happy to chat!
1Bogdan Ionut Cirstea
This might be relatively straightforward to operationalize using (subsets of) the dataset from Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs. Another related idea (besides / on top of e.g. delaying the learning of dangerous capabilities / prerequisites to scheming) could be to incentivize them to e.g. be retrieved in-context, rather than be learned in-weights (to the degree they're important for performance), for (differential) transparency reasons. Also, similarly to recent unlearning papers, it might be useful to also have a validation dataset as a proxy for which capabilities should be preserved; and potentially try (cheap) synthetic data to compensate for any capabilities losses on that one.
2jacquesthibs
Yeah, I was thinking about using SAD. The main issue is that for non-AGI-lab-sized models, you'll have a tough time eliciting SA. However, we could potentially focus on precursor capabilities and such. If you are concerned about capabilities like SA, then you might ask yourself, "it seems highly unlikely that you can figure out which data points impact SA the most because it will likely be a mix of many things and each data point will play a role in accumulating to SA." My guess is that you can break down SA into enough precursor capabilities that this approach can still be highly predictive even if it isn't 100%/ I think forcing them to retrieve in-context sounds good, but I also think labs may not want this, not sure. Basically, they'll want to train things into the model eventually, like for many CoT things. Agreed on having a validation set for reducing the alignment tax.
1Bogdan Ionut Cirstea
Here's Claude-3.5 (though I had to push it a bit in the direction of explicitly considering combing SAD and Data Shapley): 'Combining the Situational Awareness Dataset (SAD) benchmark with Shapley values, particularly the In-Run Data Shapley approach described in the other paper, could yield some interesting insights. Here are some potential ways to integrate these two approaches: 1. Attribute situational awareness to training data: Use In-Run Data Shapley to determine which training data contributes most to performance on SAD tasks. This could help identify what types of data are most important for developing situational awareness in AI models. 2. Analyze task-specific contributions: Calculate Shapley values for each category or individual task within SAD. This could reveal which parts of the training data are most influential for different aspects of situational awareness. 3. Track situational awareness development: Apply In-Run Data Shapley at different stages of training to see how the importance of different data points for situational awareness changes over time. 4. Identify potential deception enablers: Look for training data with high Shapley values for both SAD performance and other capabilities that might enable deception. This could help pinpoint data that contributes to potentially risky combinations of abilities. 5. Curate training data: Use the Shapley values to guide the curation of training datasets, potentially removing or de-emphasizing data that contributes disproportionately to unwanted levels of situational awareness. 6. Comparative analysis across models: Compare Shapley values for SAD performance across different model architectures or training regimes to understand how different approaches affect the development of situational awareness. 7. Investigate prompt influence: Apply In-Run Data Shapley to analyze how much the "situating prompt" contributes to SAD performance compared to other parts of the input. 8. Correlation studi

I'm currently ruminating on the idea of doing a video series in which I review code repositories that are highly relevant to alignment research to make them more accessible.

I do want to pick out repos with perhaps even bad documentation that are still useful and then hope on a call with the author to go over the repo and record it. At least have something basic to use when navigating the repo.

This means there would be two levels: 1) an overview with the author sharing at least the basics, and 2) a deep dive going over most of the code. The former likely contains most of the value (lower effort for me, still gets done, better than nothing, points to repo as a selection mechanism, people can at least get started).

I am thinking of doing this because I think there may be repositories that are highly useful for new people but would benefit from some direction. For example, I think Karpathy and Neel Nanda's videos have been useful in getting people started. In particular, Karpathy saw OOM more stars to his repos (e.g. nanoGPT) after the release of his videos (which, to be fair, he's famous, and a number of stars is definitely not a perfect proxy for usage).

I'm interested in any feedback ... (read more)

6Dagon
I love this idea!  I don't actually like videos, preferring searchable, exerptable text, but I may not be typical and there's room for all. At first glance, I agree with your guess that the overview/intro is more value per effort (for you and for consumers, IMO) than a deep-dive into the code. There IS probably a section of code or core modeling idea for each where it would be worth going half-deep into (algorithm and usage, not necessarily line-by-line). Note that this list is itself incredibly valuable, and you might start with an intro video (and associated text) that spends 1 minute on each and why you're planning to do it, and what you currently think will be the most important intro concept(s) for each.

Current Thoughts on my Learning System

Crossposted from my website. Hoping to provide updates on my learning system every month or so.

TLDR of what I've been thinking about lately:

  • There are some great insights in this video called "How Top 0.1% Students Think." And in this video about how to learn hard concepts.
  • Learning is a set of skills. You need to practice each component of the learning process to get better. You can’t watch a video on a new technique and immediately become a pro. It takes time to reap the benefits.
  • Most people suck at mindmaps. Mindmaps can be horrible for learning if you just dump a bunch of text on a page and point arrows to different stuff (some studies show mindmaps are ineffective, but that's because people initially suck at making them). However, if you take the time to learn how to do them well, they will pay huge dividends in the future. I’ll be doing the “Do 100 Things” challenge and developing my skill in building better mindmaps. Getting better at mindmaps involves “chunking” the material and creating memorable connections and drawings.
  • Relational vs Isolated Learning. As you learn something new, try to learn it in relation to the things you already kno
... (read more)
3jacquesthibs
Note on using ChatGPT for learning * Important part: Use GPT to facilitate the process of pushing you to higher-order learning as fast as possible. * Here’s Bloom’s Taxonomy for higher-order learning: * * For example, you want to ask GPT to come up with analogies and such to help you enter higher-order thinking by thinking about whether the analogy makes sense. * Is the analogy truly accurate? * Does it cover the main concept you are trying to understand? * Then, you can extend the analogy to try to make it better and more comprehensive. * This allows you to offload the less useful task (e.g. coming up with the analogy), and spending more time in the highest orders of learning (the evaluation phase; “is this analogy good? where does it break down?”). * You still need to use your cognitive load to encode the knowledge effectively. Look for desirable difficulty. * Use GPT to create a pre-study of the thing you would like to learn. * Have it create an outline of the order of the things you should learn. * Have it give you a list of all the jargon words in a field and how they relate so that you can quickly get up to speed on the terminology and talk to an expert. * Coming up with chunks of the topic you are exploring. * You can give GPT text that describes what you are trying to understand, the relationships between things and how you are chunking them. * Then, you can ask GPT to tell you what are some weak areas or some things that are potentially missing. * GPT works really well as a knowledge “gap-checker”. When you are trying to have GPT output some novel insights or complicated nuanced knowledge, it can give vague answers that aren’t too helpful. This is why, it is often better to treat GPT as a gap-checker and/or a friend that is prompting you to come up with great insights. Reference: I’ve been using ChatGPT/GPT-4 a lot to gain insights on how to accelerate alignment research. Some of my conclusions are similar to what was d
2jacquesthibs
How learning efficiently applies to alignment research As we are trying to optimize for actually solving the problem, we should not fall into the trap of learning just to learn. We should instead focus on learning efficiently with respect to how it helps us generate insights that lead to a solution for alignment. This is also the framing we should have in mind when we are building tools for augmenting alignment researchers. With the above in mind, I expect that part of the value of learning efficiently involves some of the following: * Efficient learning involves being hyper-focused on identifying the core concepts and how they all relate to one another. This mode of approaching things seems like it helps us attack the core of alignment much more directly and bypasses months/years of working on things that are only tangential. * Developing a foundation of a field seems key to generating useful insights. The goal is not to learn everything but to build a foundation that allows you to bypass spending way too much time tackling sub-optimal sub-problems or dead-ends for way too long. Part of the foundation-building process should reduce the time it shapes you into an exceptional alignment researcher rather than a knower-of-things. * As John Wentworth says with respect to the Game Tree of Alignment: "The main reason for this exercise is that (according to me) most newcomers to alignment waste years on tackling not-very-high-value sub-problems or dead-end strategies." * Lastly, many great innovations have not come from unique original ideas. There's an iterative process passed amongst researchers and it seems often the case that the greatest ideas come from simply merging ideas that were already lying around. Learning efficiently (and storing those learnings for later use) allows you to increase the number of ideas you can merge together. If you want to do that efficiently, you need to improve your ability to identify which ideas are worth storing in your mental wa
2Peter Hroššo
My model of (my) learning is that if the goal is sufficiently far, learning directly towards the goal is goodharting a likely wrong metric. The only method which worked for me for very distant goals is following my curiosity and continuously internalizing new info, such that the curiosity is well informed about current state and the goal.
2jacquesthibs
Curiosity is certainly a powerful tool for learning! I think any learning system which isn't taking advantage of it is sub-optimal. Learning should be guided by curiosity. The thing is, sometimes we need to learn things we aren't so curious about. One insight I Iearned from studying learning is that you can do specific things to make yourself more curious about a given thing and harness the power that comes with curiosity. Ultimately, what this looks like is to write down questions about the topic and use them to guide your curious learning process. It seems that this is how efficient top students end up learning things deeply in a shorter amount of time. Even for material they care little about, they are able to make themselves curious and be propelled forward by that. That said, my guess is that goodharting the wrong metric can definitely be an issue, but I'm not convinced that relying on what makes you naturally curious is the optimal strategy for solving alignment. Either way, it's something to think about!
1jacquesthibs
By the way, I've just added a link to a video by a top competitive programmer on how to learn hard concepts. In the video and in the iCanStudy course, both talk about the concept of caring about what you are learning (basically, curiosity). Gaining the skill to care and become curious is an essential part of the most effective learning. However, contrary to popular belief, you don't have to be completely guided by what makes you naturally curious! You can learn how to become curious (or care) about any random concept.
1jacquesthibs
Video on how to approach having to read a massive amount of information (like a textbook) as efficiently as possible: 
1jacquesthibs
Added my first post (of, potentially, a sequence) on effective learning here. I think there are a lot of great lessons at the frontier of the literature and real-world practice on learning that go far beyond the Anki approach that a lot of people seem to take these days. The important part is being effective and efficient. Some techniques might work, but that does not mean it is the most efficient (learning the same thing more deeply in less time). Note that I also added two important videos to the root shortform:
1jacquesthibs
Note on spaced repetition While spaced repetition is good, many people end up misusing it as a crutch instead of defaulting to trying to deeply understand a concept right away. As you get better at properly encoding the concept, you extend the forgetting curve to the point where repetition is less needed. Here's a video of a top-level programmer on how he approaches learning hard concepts efficiently. And here's a video on how the top 0.1% of students study efficiently.
1jacquesthibs
Here's some additional notes on the fundamentals on being an effective learner: Encoding and Retrieval (What it take to learn) * Working memory is the memory that we use. However, if it is not encoded properly or at all, we will forget it. * Encode well first (from working memory to long-term memory), then frequently and efficiently retrieve from long-term memory. * If studying feels easy, means that you aren't learning or holding on to the information. It means that you are not encoding and retrieving effectively. * You want it to be difficult when you are studying because this is how it will encode properly. Spacing, Interleaving, and Retrieval (SIR) * These are three rules that apply to every study technique in the course (unless told otherwise). You can apply SIR to all techniques. * Spacing: space your learning out. * Pre-study before class, then learn in class, and then a week later revise it with a different technique. * A rule of thumb you can follow is to wait long enough until you feel like you are just starting to forget the material. * As you get better at encoding the material effectively as soon as you are exposed to it, you will notice that you will need to do less repetition. * How to space reviews: * Beginner Schedule (less reviews need as you get better at encoding) * Same day * Next day * End of week * End of month * After learning something for the first time, review it later on the same day. * Review everything from the last 2-3 days mid-week. * Do an end of week revision on the week's worth of content. * End of month revision on entire month's worth of content. * Review of what's necessary as time goes on. * (If you're trying to do well on an exam or a coding interview, you can do the review 1 or 2 weeks before the assessment.) * Reviewing time duration: * For beginners * No less than 30 minutes per subject for end-of-week * No less th
1jacquesthibs
A few more notes: * I use the app Concepts on my iPad to draw mindmaps. Drawing mindmaps with pictures and such is way more powerful (better encoding into long-term memory) than typical mindmap apps where you just type words verbatim and draw arrows. It's excellent since it has a (quasi-) infinite canvas. This is the same app that Justin Sung uses. * When I want to go in-depth into a paper, I will load it into OneNote on my iPad and draw in the margin to better encode my understanding of the paper. * I've been using the Voice Dream Reader app on my iPhone and iPad to get through posts and papers much faster (I usually have time to read most of an Alignment Forum post on my way to work and another on the way back). Importantly, I stop the text-to-speech when I'm trying to understand an important part. I use Pocket to load LW/AF posts into it and download PDFs on my device and into the app for reading papers. There's a nice feature in the app that automatically skips citations in the text, so reading papers isn't as annoying. The voices are robotic, but I just cycled through a bunch until I found one I didn't mind (I didn't buy any, but there are premium voices). I expect Speechify has better voices, but it's more expensive, and I think people find that the app isn't as good overall compared to Voice Dream Reader. Thanks to Quintin Pope for recommending the app to me.

I’m still thinking this through, but I am deeply concerned about Eliezer’s new article for a combination of reasons:

  • I don’t think it will work.
  • Given that it won’t work, I expect we lose credibility and it now becomes much harder to work with people who were sympathetic to alignment, but still wanted to use AI to improve the world.
  • I am not convinced as he is about doom and I am not as cynical about the main orgs as he is.

In the end, I expect this will just alienate people. And stuff like this concerns me.

I think it’s possible that the most memetically powerful approach will be to accelerate alignment rather than suggesting long-term bans or effectively antagonizing all AI use.

So I think what I'm getting here is that you have an object-level disagreement (not as convinced about doom), but you are also reinforcing that object-level disagreement with signalling/reputational considerations (this will just alienate people). This pattern feels ugh and worries me. It seems highly important to separate the question of what's true from the reputational question. It furthermore seems highly important to separate arguments about what makes sense to say publicly on-your-world-model vs on-Eliezer's-model. In particular, it is unclear to me whether your position is "it is dangerously wrong to speak the truth about AI risk" vs "Eliezer's position is dangerously wrong" (or perhaps both). 

I guess that your disagreement with Eliezer is large but not that large (IE you would name it as a disagreement between reasonable people, not insanity). It is of course possible to consistently maintain that (1) Eliezer's view is reasonable, (2) on Eliezer's view, it is strategically acceptable to speak out, and (3) it is not in fact strategically acceptable for people with Eliezer's views to speak out about those views. But this combination of views does imply endorsing a silencing of reasonable disagreements which seems unfortunate and anti-epistemic. 

My own guess is that the maintenance of such anti-epistemic silences is itself an important factor contributing to doom. But, this could be incorrect.

2jacquesthibs
Yeah, so just to clarify a few things: * This was posted on the day of the open letter and I was indeed confused about what to think of the situation. * I think something I failed to properly communicate is that I was worried that this was a bad time to pull the lever even if I’m concerned about risks from AGI. I was worried the public wouldn’t take alignment seriously because they cause a panic much sooner than people were ready for. * I care about being truthful, but I care even more about not dying so my comment was mostly trying to communicate that I didn’t think this was the best strategic decision for not dying. * I was seeing a lot of people write negative statements about the open letter on Twitter and it kind of fed my fears that this was going to backfire as a strategy and impact all of our work to make ai risk taken seriously. * In the end, the final thing that matters is that we win (i.e. not dying from AGI). I’m not fully sure what I think now (mostly because I don’t know about higher order effects that will happen 2-3 years from now), but I think it turned out a lot strategically better than I initially expected.
8jacquesthibs
To try and burst any bubble about people’s reaction to the article, here’s a set of tweets critical about the article: * https://twitter.com/mattparlmer/status/1641230149663203330?s=61&t=ryK3X96D_TkGJtvu2rm0uw  * https://twitter.com/jachiam0/status/1641271197316055041?s=61&t=ryK3X96D_TkGJtvu2rm0uw  * https://twitter.com/finbarrtimbers/status/1641266526014803968?s=61&t=ryK3X96D_TkGJtvu2rm0uw  * https://twitter.com/plinz/status/1641256720864530432?s=61&t=ryK3X96D_TkGJtvu2rm0uw  * https://twitter.com/perrymetzger/status/1641280544007675904?s=61&t=ryK3X96D_TkGJtvu2rm0uw  * https://twitter.com/post_alchemist/status/1641274166966996992?s=61&t=ryK3X96D_TkGJtvu2rm0uw  * https://twitter.com/keerthanpg/status/1641268756071718913?s=61&t=ryK3X96D_TkGJtvu2rm0uw * https://twitter.com/levi7hart/status/1641261194903445504?s=61&t=ryK3X96D_TkGJtvu2rm0uw * https://twitter.com/luke_metro/status/1641232090036600832?s=61&t=ryK3X96D_TkGJtvu2rm0uw * https://twitter.com/gfodor/status/1641236230611562496?s=61&t=ryK3X96D_TkGJtvu2rm0uw * https://twitter.com/luke_metro/status/1641263301169680386?s=61&t=ryK3X96D_TkGJtvu2rm0uw * https://twitter.com/perrymetzger/status/1641259371568005120?s=61&t=ryK3X96D_TkGJtvu2rm0uw * https://twitter.com/elaifresh/status/1641252322230808577?s=61&t=ryK3X96D_TkGJtvu2rm0uw  * https://twitter.com/markovmagnifico/status/1641249417088098304?s=61&t=ryK3X96D_TkGJtvu2rm0uw * https://twitter.com/interpretantion/status/1641274843692691463?s=61&t=ryK3X96D_TkGJtvu2rm0uw * https://twitter.com/lan_dao_/status/1641248437139300352?s=61&t=ryK3X96D_TkGJtvu2rm0uw * https://twitter.com/lan_dao_/status/1641249458053861377?s=61&t=ryK3X96D_TkGJtvu2rm0uw  * https://twitter.com/growing_daniel/status/1641246902363766784?s=61&t=ryK3X96D_TkGJtvu2rm0uw  * https://twitter.com/alexandrosm/status/1641259179955601408?s=61&t=ryK3X96D_TkGJtvu2rm0uw 
2Viliam
What is the base rate for Twitter reactions for an international law proposal?
1jacquesthibs
Of course it’s often all over the place. I only shared the links because I wanted to make sure people weren’t deluding themselves with only positive comments.
2Viliam
This reminds me of the internet-libertarian chain of reasoning that anything that government does is protected by the threat of escalating violence, therefore any proposals that involve government (even mild ones, such as "once in a year, the President should say 'hello' to the citizens") are calls for murder, because... (create a chain of escalating events starting with someone non-violently trying to disrupt this, ending with that person being killed by cops)... Yes, a moratorium on AIs is a call for violence, but only in the sense that every law is a call for violence.
1[comment deleted]

Quote from Cal Newport's Slow Productivity book: "Progress in theoretical computer science research is often a game of mental chicken, where the person who is able to hold out longer through the mental discomfort of working through a proof element in their mind will end up with the sharper result."

1keltan
Big fan of Cal’s work. He’s certainly someone who is pushing the front lines in the fight against acrasia. I’m currently reading “how to win at college”. It’s a super information dense package. Feels a bit like rationality from a-z, if it were specifically for college students trying to succeed. Why did you decide to share this quote? I feel like I’m missing some key context that could aid my understanding.

Do we expect future model architectures to be biased toward out-of-context reasoning (reasoning internally rather than in a chain-of-thought)? As in, what kinds of capabilities would lead companies to build models that reason less and less in token-space?

I mean, the first obvious thing would be that you are training the model to internalize some of the reasoning rather than having to pay for the additional tokens each time you want to do complex reasoning.

The thing is, I expect we'll eventually move away from just relying on transformers with scale. And so... (read more)

3Seth Herd
This is an excellent point. While LLMs seem (relatively) safe, we may very well blow right on by them soon. I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values). Those would be retained in improved architectures that still predict human language use. If such a system's thought process was entirely opaque, we could no longer perform Externalized reasoning oversight by "reading its thoughts". But think it might be possible to build a reliable agent from unreliable parts. I think humans are such an agent, and evolution made us this way because it's a way to squeeze extra capability out of a set of base cognitive capacities. Imagine an agentic set of scaffolding that merely calls the super-LLM for individual cognitive acts. Such an agent would use a hand-coded "System 2" thinking approach to solve problems, like humans do. That involves breaking a problem into cognitive steps. We also use System 2 for our biggest ethical decisions; we predict consequences of our major decisions, and compare them to our goals, including ethical goals. Such a synthetic agent would use System 2 for problem-solving capabilities, and also for checking plans for how well they achieve goals. This would be done for efficiency; spending a lot of compute or external resources on a bad plan would be quite costly. Having implemented it for efficiency, you might as well use it for safety. This is just restating stuff I've said elsewhere, but I'm trying to refine the model, and work through how well it might work if you couldn't apply any external reasoning oversight, and little to no interpretability. It's definitely bad for the odds of success, but not necessarily crippling. I think. This needs more thought. I'm working on a post on System 2 alignment, as sketched out briefly (and probably incomprehensibly) above.
4ryan_greenblatt
Did you mean something different than "AIs understand our intentions" (e.g. maybe you meant that humans can understand the AI's intentions?). I think future more powerful AIs will surely be strictly better at understanding what humans intend.
2Seth Herd
I think future more powerful/useful AIs will understand our intentions better IF they are trained to predict language. Text corpuses contain rich semantics about human intentions. I can imagine other AI systems that are trained differently, and I would be more worried about those. That's what I meant by current AI understanding our intentions possibly better than future AI.

I'm currently working on building an AI research assistant designed specifically for alignment research. I'm at the point where I will be starting to build specific features for the project and delegate work to developers who would like to contribute.

  • Developers: If you are a developer who might be interested in contributing to this project, send me a DM for more details.
  • Alignment Researchers: I have a long list of features I want to build. I need to prioritize the features that people actually think would help them the most. If you'd like to look over the features and provide feedback, send me a DM and I will send you the relevant list of features.
2jacquesthibs
Alignment Math people: I would appreciate it if someone could review this video of Terrence Tao giving a presentation on machine-assisted proofs to give feedback on what they think an ideal alignment assistant could do in this domain. In addition, I'm thinking of eventually looking at models like DeepSeek-Prover to see if they can be beneficial for assisting alignment researchers in creating proofs:
6mesaoptimizer
I've experimented with Claude Opus for simple Ada autoformalization test cases (specifically quicksort), and it seems like the sort of issues that make LLM agents infeasible (hallucination-based drift, subtle drift caused by sticking to certain implicit assumptions you made before) are also the issues that make Opus hard to use for autoformalization attempts. I haven't experimented with a scaffolded LLM agent for autoformalization, but I expect it won't go very well either, primarily because scaffolding involves attempts to make human-like implicit high-level cognitive strategies into explicit algorithms or heuristics such as tree of thought prompting, and I expect that this doesn't scale given the complexity of the domain (sufficently general autoformalizing AI systems can be modelled as effectively consequentialist, which makes them dangerous). I don't expect for a scaffolded (over Opus) LLM agent to succeed at autoformalizing quicksort right now either, mostly because I believe RLHF tuning has systematically optimized Opus to write the bottom line first and then attempt to build or hallucinate a viable answer, and then post-hoc justify the answer. (While steganographic non-visible chain-of-thought may have gone into figuring out the bottom line, it still is worse than first doing visible chain-of-thought such that it has more token-compute-iterations to compute its answer.) If anyone reading this is able to build a scaffolded agent that autoformalizes (using Lean or Ada) algorithms of complexity equivalent to quicksort reliably (such that more than 5 out of 10 of its attempts succeed) within the next month of me writing this comment, then I'd like to pay you 1000 EUR to see your code and for an hour of your time to talk with you about this. That's a little less than twice my current usual monthly expenses, for context.
2jacquesthibs
Great. Yeah, I also expect that it is hard to get current models to work well on this. However, I will mention that the DeepSeekMath model does seem to outperform GPT-4 despite having only 7B parameters. So, it may be possible to create a +70B fine-tune that basically destroys GPT-4 at math. The issue is whether it generalizes to the kind of math we'd commonly see in alignment research. Additionally, I expect at least a bit can be done with scaffolding, search, etc. I think the issue with many prompting methods atm is that they are specifically trying to get the model to arrive at solutions on their own. And what I mean by that is that they are starting from the frame of "how can we get LLMs to solve x math task on their own," instead of "how do we augment the researcher's ability to arrive at (better) proofs more efficiently using LLMs." So, I think there's room for product building that does not involve "can you solve this math question from scratch," though I see the value in getting that to work as well.

(This is the tale of a potentially reasonable CEO of the leading AGI company, not the one we have in the real world. Written after a conversation with @jdp.)

You’re the CEO of the leading AGI company. You start to think that your moat is not as big as it once was. You need more compute and need to start accelerating to give yourself a bigger lead, otherwise this will be bad for business.

You start to look around for compute, and realize you have 20% of your compute you handed off to the superalignment team (and even made a public commitment!). You end up ma... (read more)

So, you go to government and lobby. Except you never intended to help the government get involved in some kind of slow-down or pause. Your intent was to use this entire story as a mirage for getting rid of those who didn’t align with you and lobby the government in such a way that they don’t think it is such a big deal that your safety researchers are resigning.

You were never the reasonable CEO, and now you have complete power.

From a Paul Christiano talk called "How Misalignment Could Lead to Takeover" (from February 2023):

Assume we're in a world where AI systems are broadly deployed, and the world has become increasingly complex, where humans know less and less about how things work.

A viable strategy for AI takeover is to wait until there is certainty of success. If a 'bad AI' is smart, it will realize it won't be successful if it tries to take over, not a problem. 

So you lose when a takeover becomes possible, and some threshold of AIs behave badly. If all the smartest AIs... (read more)

This seems like a fairly important paper by Deepmind regarding generalization (and lack of it in current transformer models): https://arxiv.org/abs/2311.00871 

Here’s an excerpt on transformers potentially not really being able to generalize beyond training data:

Our contributions are as follows:

  • We pretrain transformer models for in-context learning using a mixture of multiple distinct function classes and characterize the model selection behavior exhibited.
  • We study the in-context learning behavior of the pretrained transformer model on functions th
... (read more)
6leogao
i predict this kind of view of non magicalness of (2023 era) LMs will become more and more accepted, and this has implications on what kinds of alignment experiments are actually valuable (see my comment on the reversal curse paper). not an argument for long (50 year+) timelines, but is an argument for medium (10 year) timelines rather than 5 year timelines
5leogao
also this quote from the abstract is great: i used to call this something like "tackling the OOD generalization problem by simply making the distribution so wide that it encompasses anything you might want to use it on"
4Noosphere89
I'd say my major takeaways, assuming this research scales (it was only done on GPT-2, and we already knew it couldn't generalize.) 1. Gary Marcus was right about LLMs mostly not reasoning outside the training distribution, and this updates me more towards "LLMs probably aren't going to be godlike, or be nearly as impactful as LW say it is." 2. Be more skeptical of AI progress leading to big things, and in general unless reality can simply be memorized, scaling probably won't work to automate the economy. More generally, this updates me towards longer timelines, and longer tails on those timelines. 3. Be slightly more pessimistic on AI safety, since LLMs have a bunch of nice properties, and future AI probably will have less nice properties, though alignment optimism mostly doesn't depend on LLMs. 4. AI governance gets a lucky break, since they only have to regulate misuse, and even though their threat model isn't likely or even probable to be realized, it's still nice that we don't have to deal with the disruptive effects of AI now.
3jacquesthibs
I am sharing this since I think it will change your view on how much to update on this paper (I should have shared this initially). Here's what the paper author said on X:     So, with that, I said: To which @Jozdien replied:
2Noosphere89
In retrospect, I probably should have updated much less than I did, I thought that it was actually testing a real LLM, which makes me less confident in the paper. Should have responded long ago, but responding now.
2jacquesthibs
Title: Is the alignment community over-updating on how scale impacts generalization? So, apparently, there's a rebuttal to the recent Google generalization paper (and also worth pointing out it wasn't done with language models, just sinoïsodal functions, not language): But then, the paper author responds: ---------------------------------------- This line of research makes me question one thing: "Is the alignment community over-updating on how scale impacts generalization?" It remains to be seen how well models will generalize outside of their training distribution (interpolation vs extrapolation). In other words, when people say that GPT-4 (and other LLMs) can generalize, I think they need to be more careful about what they really mean. Is it doing interpolation or extrapolation? Meaning, yes, GPT-4 can do things like write a completely new poem, but poems and related stuff were in its training distribution! So, you can say it is generalizing, but I think it's a much weaker form of generalization than what people really imply when they say generalization. A stronger form of generalization would be an AI that can do completely new tasks that are actually outside of its training distribution. Now, at this point, you might say, "yes, but we know that LLMs learn functions and algorithms to do tasks, and as you scale up and compress more and more data, you will uncover more meta-algorithms that can do this kind of extrapolation/tasks outside of the training distribution." Well, two things: 1. It remains to be seen when or if this will happen in the current paradigm (no matter how much you scale up). 2. It's not clear to me how well things like induction heads continue to work on things that are outside of their training distribution. If they don't adapt well, then it may be the same thing for other algorithms. What this would mean in practice, I'm not sure. I've been looking at relevant papers, but haven't found an answer yet. This brings me to another point
5jacquesthibs
Or perhaps as @Nora Belrose mentioned to me: "Perhaps we should queer the interpolation-extrapolation distinction."
2Garrett Baker
Some evidence this is not so fundamental, and we should expect a (or many) phase transition(s) to more generalizing in context learning as we increase the log number of tasks.
1Oliver Sourbut
My hot take is that this paper's prominence is a consequence of importance hacking (I'm not accusing the authors in particular). Zero or near-zero relevance to LLMs. Authors get a yellow card for abusing the word 'model' twice in the title alone.

Given funding is a problem in AI x-risk at the moment, I’d love to see people to start thinking of creative ways to provide additional funding to alignment researchers who are struggling to get funding.

For example, I’m curious if governance orgs would pay for technical alignment expertise as a sort of consultant service.

Also, it might be valuable to have full-time field-builders that are solely focused on getting more high-net-worth individuals to donate to AI x-risk.

On joking about how "we're all going to die"

Setting aside the question of whether people are overly confident about their claims regarding AI risk, I'd like to talk about how we talk about it amongst ourselves.

We should avoid jokingly saying "we're all going to die" because I think it will corrode your calibration to risk with respect to P(doom) and it will give others the impression that we are all more confident about P(doom) than we really are.

I think saying it jokingly still ends up creeping into your rational estimates on timelines and P(doom). I expe... (read more)

What are some important tasks you've found too cognitively taxing to get in the flow of doing?

One thing that I'd like to consider for Accelerating Alignment is to build tools that make it easier to get in the habit of cognitively demanding tasks by reducing the cognitive load necessary to do the task. This is part of the reason why I think people are getting such big productivity gains from tools like Copilot.

One way I try to think about it is like getting into the habit of playing guitar. I typically tell people to buy an electric guitar rather than an ac... (read more)

5plex
For developing my hail mary alignment approach, the dream would be to be able to load enough of the context of the idea into a LLM that it could babble suggestions (since the whole doc won't fit in the context window, maybe randomizing which parts beyond the intro are included for diversity?), then have it self-critique those suggestions automatically in different threads in bulk and surface the most promising implementations of the idea to me for review. In the perfect case I'd be able to converse with the model about the ideas and have that be not totally useless, and pump good chains of thought back into the fine-tuning set.

Projects I'd like to work on in 2023.

Wrote up a short (incomplete) bullet point list of the projects I'd like to work on in 2023:

  • Accelerating Alignment
    • Main time spent (initial ideas, will likely pivot to varying degrees depending on feedback; will start with one):
      • Fine-tune GPT-3/GPT-4 on alignment text and connect the API to LoomVSCode (CoPilot for alignment research) and potentially notetaking apps like Roam Research. (1-3 months, depending on bugs and if we continue to add additional features.)
      • Create an audio-to-post pipeline where we can eas
... (read more)
7jacquesthibs
Two other projects I would find interesting to work on: * Causal Scrubbing to remove specific capabilities from a model. For example, training a language model on The Pile and a code dataset. Then, applying causal scrubbing to try and remove the model's ability to generate code while still achieving the similar loss on The Pile. * A few people have started extending the work from the Discovering Latent Knowledge in Language Models without Supervision paper. I think this work could potentially evolve into a median-case solution to avoiding x-risk from AI.
3chanamessinger
Curious if you have any updates!
2jacquesthibs
Working on a new grant proposal right now. Should be sent this weekend. If you’d like to give feedback or have a look, please send me a DM! Otherwise, I can send the grant proposal to whoever wants to have a look once it is done (still debating about posting it on LW). Outside of that, there has been a lot of progress on the Cyborgism discord (there is a VSCode plugin called Worldspider that connects to the various APIs, and there has been more progress on Loom). Most of my focus has gone towards looking at the big picture and keeping an eye on all the developments. Now, I have a better vision of what is needed to create an actually great alignment assistant and have talked to other alignment researchers about it to get feedback and brainstorm. However, I’m spread way too thin and will request additional funding to get some engineer/builder to start building the ideas out so that I can focus on the bigger picture and my alignment work. If I can get my funding again (previous funding ended last week) then my main focus will be building out the system I have in my for accelerating alignment work + continue working on the new agenda I put out with Quintin and others. There’s some other stuff I‘d like to do, but those are lower priority or will depend on timing. It’s been hard to get the funding application done because things are moving so fast and I’m trying not to build things that will be built by default. And I’ve been talking to some people about the possibility of building an org so that this work could go a lot faster.
3plex
Very excited by this agenda, was discussing my hope that someone finetunes LLMs on the alignment archive soon today!
2Mati_Roy
do you have a link? I'd be interested in being added to the Discord

OpenAI CEO Sam Altman has privately said the company could become a benefit corporation akin to rivals Anthropic and xAI.

Paywalled: https://www.theinformation.com/articles/openai-ceo-says-company-could-become-benefit-corporation-akin-to-rivals-anthropic-xai

"Sam Altman recently told some shareholders that OAI is considering changing its governance structure to a for-profit business that OAI's nonprofit board doesn't control. [...] could open the door to public offering of OAI; may give Altman an opportunity to take a stake in OAI."

[-]Viliam129

Perhaps I am too cynical, but it seems to me that Sam Altman will say anything... and change his mind later.

2ChristianKl
It's still interesting that he calculated, that it is advantageous to say it. 
4ozziegooen
Quick point - a "benefit corporation" seems almost identical to a "corporation" to me, from what I understand. I think many people assume it's a much bigger deal than it actually is. My impression is that practically speaking, this just gives the execs more power to do whatever they feel they can sort of justify, without shareholders being able to have the legal claims to stop them. I'm not sure if this is a good thing in the case of OpenAI. (Would we prefer Sam A / the board have more power, or that the shareholders have more power?)  I think B-Corps make it harder for them to get sued for not optimizing for shareholders. Hypothetically, it makes it easier for them to be sued for not optimizing their other goals, but I'm not sure if this ever/frequently actually happens. 
2ChristianKl
In the case of OpenAI it also means that Sam doesn't hold any stock in OpenAI and thus has different incentives than he would if he would own a decent amount of stock.
2ChristianKl
It would seem strange to me if that's legally possible, but maybe it is. 

Jacques' AI Tidbits from the Web

I often find information about AI development on X (f.k.a.Twitter) and sometimes other websites. They usually don't warrant their own post, so I'll use this thread to share. I'll be placing a fairly low filter on what I share.

There's someone on X (f.k.a.Twitter) called Jimmy Apples (🍎/acc) and he has shared some information in the past that turned out to be true (apparently the GPT-4 release date and that OAI's new model would be named "Gobi"). He recently tweeted, "AGI has been achieved internally." Some people think that the Reddit comment below may be from the same guy (this is just a weak signal, I’m not implying you should consider it true or update on it):

7elifland
Where is the evidence that he called OpenAI’s release date and the Gobi name? All I see is a tweet claiming the latter but it seems the original tweet isn’t even up?
2jacquesthibs
This is the tweet for Gobi: https://x.com/apples_jimmy/status/1703871137137176820?s=46&t=YyfxSdhuFYbTafD4D1cE9A I asked someone if it’s fake. Apparently not, you can find it on google archive: https://threadreaderapp.com/thread/1651837957618409472.html
3Person
Predicting the GPT-4 launch date can easily be disproven with the confidence game. It's possible he just created a prediction for every day and deleted the ones that didn't turn out right. For the Gobi prediction it's tricky. The only evidence is the Threadreader and a random screenshot from a guy who seems clearly related to jim. I am very suspicious of the Threadreader one. On one hand I don't see a way it can be faked, but it's very suspicious that the Gobi prediction is Jimmy's only post that was saved there despite him making an even bigger bombshell "prediction". It's also possible, though unlikely, that the Information's article somehow found his tweet and used it as a source for their article. What kills Jimmy's credibility for me is his prediction back in January (you can use the Wayback Machine to find it) that OAI had finished training GPT-5, no not a GPT-5 level system, the ACTUAL GPT-5 in October 2022 and that it was 125T parameters. Also goes without saying, pruning his entire account is suspicious too. 
2jacquesthibs
I’ll try to find them, but this was what people were saying. They also said he deleted past tweets so that evidence may forever be gone. I remember one tweet where Jimmy said something like, “Gobi? That’s old news, I said that months ago, you need to move on to the new thing.” And I think he linked the tweet though I’m very unsure atm. Need to look it up, but you can use the above for a search.
2jacquesthibs
New tweet by Jimmy Apples. This time, he's insinuating that OpenAI is funding a stealth startup working on BCI. If this is true, then it makes sense they would prefer not to do it internally to avoid people knowing in advance based on their hires. A stealth startup would keep things more secret. Might be of interest, @lisathiergart and @Allison Duettmann.
2jacquesthibs
Not sure exactly what this means, but Jimmy Apples has now tweeted the following: My gut is telling me that he apple-bossed too close to the sun (released info he shouldn't have, and now that he's concerned about his job or some insider's job), and it's time for him to stop sharing stuff (the apple being bitten symbolizing that he is done sharing info). This is because the information in my shortform was widely shared on X and beyond. He also deleted all of his tweets (except for the retweets).
1Person
Or that he was genuinely just making things up and tricking us for fun, and a cryptic exit is a perfect way to leave the scene. I really think people are looking way too deep into him and ignoring the more outlandish predictions he's made (125T GPT-4 and 5 in October 2022), along with the fact there is never actual evidence of his accurate ones, only 2nd hand very specific and selective archives.
2jacquesthibs
He did say some true things before. I think it's possible all of the new stuff is untrue, but we're getting more reasons to believe it's not entirely false. The best liars sprinkle in truth. I think, as a security measure, it's also possible that even people within OpenAI know all the big details of what's going on (this is apparently the case for Anthropic). This could mean, for OpenAI employees, that some details are known while others are not. Employees themselves could be forced to speculate on some things. Either way, I'm not obsessing too much over this. Just sharing what I'm seeing.
3jacquesthibs
More predictions/insights from Jimmy and crew. He's implying that people (like I have also been saying) that some people are far too focused on scale over training data and architectural improvements. IMO, the bitter lesson is a thing, but I think we've over-updated on it. Relatedly, someone shared a new 13B model that apparently is better and comparable to GPT-4 in logical reasoning (based on benchmarks, which I don't usually trust too much). Note that the model is a solver-augmented LM. Here's some context regarding the paper:
2jacquesthibs
Sam Altman at a YC founder reunion: https://x.com/smahsramo/status/1706006820467396699?s=46&t=YyfxSdhuFYbTafD4D1cE9A “Most interesting part of @sama talk: GPT5 and GPT6 are “in the bag” but that’s likely NOT AGI (eg something that can solve quantum gravity) without some breakthroughs in reasoning. Strong agree.”
2Mitchell_Porter
AGI is "something that can solve quantum gravity"?  That's not just a criterion for general intelligence, that's a criterion for genius-level intelligence. And since general intelligence in a computer has advantages of speed, copyability, little need for down time, that are not possessed by general intelligence, AI will be capable of contributing to its training, re-design, agentization, etc, long before "genius level" is reached.  This underlines something I've been saying for a while, which is that superintelligence, defined as AI that definitively surpasses human understanding and human control, could come into being at any time (from large models that are not publicly available but which are being developed privately by Big AI companies). Meanwhile, Eric Schmidt (former Google CEO) says about five years until AI is actively improving itself, and that seems generous.  So I'll say: timeline to superintelligence is 0-5 years. 
2Vladimir_Nesov
In some models of the world this is seen as unlikely to ever happen, these things are expected to coincide, which collapses the two definitions of AGI. I think the disparity between sample efficiency of in-context learning and that of pre-training is one illustration for how these capabilities might come apart, in the direction that's opposite to what you point to: even genius in-context learning doesn't necessarily enable the staying power of agency, if this transient understanding can't be stockpiled and the achieved level of genius is insufficient to resolve the issue while remaining within its limitations (being unable to learn a lot of novel things in the course of a project).
2jacquesthibs
Someone in the open source community tweeted: "We're about to change the AI game. I'm dead serious." My guess is that he is implying that they will be releasing open source mixture of experts models in a few months from now. They are currently running them on CPUs.
2jacquesthibs
Lots of cryptic tweet from the open source LLM guys: https://x.com/abacaj/status/1705781881004847267?s=46&t=YyfxSdhuFYbTafD4D1cE9A “If you thought current open source LLMs are impressive… just remember they haven’t peaked yet” To be honest, my feeling is that they are overhyping how big of deal this will be. Their ego and self-importance tend to be on full display.
3Person
Occasionally reading what OSS AI gurus say, they definitely overhype their stuff constantly. The ones who make big claims and try to hype people up are often venture entrepreneur guys rather than actual ML engineers. 
2jacquesthibs
The open source folks I mostly keep an eye on are the ones who do actually code and train their own models. Some are entrepreneurs, but they know a decent amount. Not top engineers, but they seem to be able to curate datasets and train custom models. There’s some wannabe script kiddies too, but once you lurk enough, you become aware of who are actually decent engineers (you’ll find some at Vector Institute and Jeremy Howard is pro- open source, for example). I wouldn’t totally discount them having an impact, even though some of them will overhype.

I think it would be great if alignment researchers read more papers

But really, you don't even need to read the entire paper. Here's a reminder to consciously force yourself to at least read the abstract. Sometimes I catch myself running away from reading an abstract of a paper even though it is very little text. Over time I've just been forcing myself to at least read the abstract. A lot of times you can get most of the update you need just by reading the abstract. Try your best to make it automatic to do the same.

To read more papers, consider using Semant... (read more)

On hyper-obession with one goal in mind

I’ve always been interested in people just becoming hyper-obsessed in pursuing a goal. One easy example is with respect to athletes. Someone like Kobe Bryant was just obsessed with becoming the best he could be. I’m interested in learning what we can from the experiences of the hyper-obsessed and what we can apply to our work in EA / Alignment.

I bought a few books on the topic, I should try to find the time to read them. I’ll try to store some lessons in this shortform, but here’s a quote from Mr. Beast’s Joe Rogan in... (read more)

I'm exploring the possibility of building an alignment research organization focused on augmenting alignment researchers and progressively automating alignment research (yes, I have thought deeply about differential progress and other concerns). I intend to seek funding in the next few months, and I'd like to chat with people interested in this kind of work, especially great research engineers and full-stack engineers who might want to cofound such an organization. If you or anyone you know might want to chat, let me know! Send me a DM, and I can send you ... (read more)

2Nathan Helm-Burger
I'm intrigued. Let me know if you'd like to chat

I had this thought yesterday: "If someone believes in the 'AGI lab nationalization by default' story, then what would it look like to build an organization or startup in preparation for this scenario?"

For example, you try to develop projects that would work exceptionally well in a 'nationalization by default' world while not getting as much payoff if you are in a non-nationalization world. The goal here is to do the normal startup thing: risky bets with a potentially huge upside.

I don't necessarily support nationalization and am still trying to think throu... (read more)

2Viliam
If I expected that an AI above a certain line will be nationalized, I would try to get just below than line and stay there for as long as possible, to maximize my profits. Alternatively, I may choose to cross the line at the moment my competitors get close to it, if I want to be remembered by the history as the one who reached it first. If I expected that an AI above a certain line will be nationalized, but I believed that the government will give a lot of money in return (for example, to convince other entrepreneurs that the country is not turning into a communist dystopia), I might decide to try to get that money as soon as possible (less work, guaranteed profit, a place in history), so I would actually exaggerate how dangerous my AI is, to make the government take it away sooner. But if my goal is to avoid nationalization... One option is to make my work distributed across countries, so whenever one of them starts talking about nationalizing it, I will make it clear that they can only take a small part of it, and I will simply continue developing it in the remaining countries. I would move the key members of my company to countries with the lowest probability of nationalization. Actually, they could live in a country where I have no servers, and they would connect to them remotely. So that when the government takes the servers, it cannot compel the people to explain how the entire thing works, or prevent them from remotely destroying the part that got nationalized. Also, every part would have a backup in another country. Another option is to make the AI pretend that it is less smart than it is, to officially stay below the line. It would mean I cannot directly sell its abilities to customers, but maybe I could use it myself, e.g. to let it manage my finances, or I could use it in a plausibly deniable way, e.g. you can hire my smaller company that hires 100 experts who also use my AI, and everything they do is officially attributed to the genius of the human e
4jacquesthibs
We had a similar thought: But yeah, my initial comment was about how to take advantage of nationalization if it does happen in the way Leopold described/implied.
2jacquesthibs
If anyone would like to discuss this privately, please message me. I'm considering whether to build a startup that tackles the kinds of things I describe above (e.g., monitoring), so I would love to get feedback.
1davekasten
If you think nationalization is near and the default, you shouldn't try to build projects and hope they get scooped into the nationalized thing.  You should try to directly influence the policy apparatus through writing, speaking on podcasts, and getting to know officials in the agencies most likely to be in charge of that. (Note: not a huge fan of nationalization myself due to red-queen's-race concerns)
2jacquesthibs
You can do the writing, but if you have a useful product and connect with those who are within the agencies, you are in a position where you have built a team and infrastructure for several years with the purpose of getting pulled into the nationalization project. You likely get most of the value by just keeping close ties with others within government while also have built a ready-to-use solution that can prevent the government from rushing out a worse version of what you’ve built. I think it’s important to see AI Safety as a collective effort rather than one person’s decision (of working inside or out of government).
1davekasten
I think I am very doubtful of the ability of outsiders to correctly predict -- especially outsiders new to government contracting -- what the government might pull in.  I'd love to be wrong, though!  Someone should try it, and I think I was probably too definitive in my comment above.
2jacquesthibs
Yes, but this is similar to usual startups, it’s a calculated bet you are making. So you expect some of the people to try this will fail, but investors hope one of them will be a unicorn.
1gw
This might look like building influence / a career in the federal orgs that would be involved in nationalization, rather than a startup. Seems like positioning yourself to be in charge of nationalized projects would be the highest impact?
2jacquesthibs
I agree that this would be impactful! I'm mostly thinking about a more holistic approach that assumes you'd have reasonable to 'the right people' in those government positions. Similar to the current status quo where you have governance people and technical people filling in the different gaps.

Anybody know how Fathom Radiant (https://fathomradiant.co/) is doing?

They’ve been working on photonics compute for a long time so I’m curious if people have any knowledge on the timelines they expect it to have practical effects on compute.

Also, Sam Altman and Scott Gray at OpenAI are both investors in Fathom. Not sure when they invested.

I’m guessing it’s still a long-term bet at this point.

OpenAI also hired someone who worked at PsiQuantum recently. My guess is that they are hedging their bets on the compute end and generally looking for opportunities on ... (read more)

4jacquesthibs
I'm working on publishing a post on this and energy bottlenecks. If anyone is interested in doing a quick skim for feedback, I hope to publish it in under two hours.
2jacquesthibs
I took this post down, given that some people have been downvoting it heavily. Writing my thoughts here as a retrospective: I think one reason it got downvoted is that I used Claude as part of the writing process and it was too disjointed/obvious (because I wanted to rush the post out), but I didn't think it was that bad and I did try to point out that it was speculative in the parts that mattered. One comment specifically pointed out that it felt like a lot was written by an LLM, but I didn't think I relied on Claude that much and I rewrote the parts that included LLM writing. I also don't feel as strongly about using this as a reason to dislike a piece of writing, though I understand the current issue of LLM slop. However, I wonder if some people downvoted it because they see it as infohazardous. My goal was to try to determine if photonic computing would become a big factor at some point (which might be relevant from a forecasting and governance perspective) and put out something quick for discussion rather than spending much longer researching and re-writing. I agreed with what I shared. But I may need to adjust my expectations as to what people prefer as things worth sharing on LessWrong.

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

  • Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowi
... (read more)

I think people might have the implicit idea that LLM companies will continue to give API access as the models become more powerful, but I was talking to someone earlier this week that made me remember that this is not necessarily the case. If you gain powerful enough models, you may just keep it to yourself and instead spin AI companies with AI employees to make a ton of cash instead of just charging for tokens.

For this reason, even if outside people build the proper brain-like AGI setup with additional components to squeeze out capabilities from LLMs, they may be limited by:

1. open-source models

2. the API of the weaker models from the top companies

3. the best API of the companies that are lagging behind

A frame for thinking about takeoff

One error people can make when thinking about takeoff speeds is assuming that because we are in a world with some gradual takeoff, it now means we are in a "slow takeoff" world. I think this can lead us to make some mistakes in our strategy. I usually prefer thinking in the following frame: “is there any point in the future where we’ll have a step function that prevents us from doing slow takeoff-like interventions for preventing x-risk?”

In other words, we should be careful to assume that some "slow takeoff" doesn't have a... (read more)

Imagine there was an AI-suggestion tool that could predict reasons why you agree/disagree-voted on a comment, and you just had to click one of the generated answers to provide a bit of clarity at a low cost.

Easy LessWrong post to LLM chat pipeline (browser side-panel)

I started using Sider as @JaimeRV recommended here. Posting this as a top-level shortform since I think other LessWrong users should be aware of it.

Website with app and subscription option. Chrome extension here.

You can either pay for the monthly service and click the "summarize" feature on a post and get the side chat window started or put your OpenAI API / ChatGPT Pro account in the settings and just cmd+a the post (which automatically loads the content in the chat so you can immediately ask a ... (read more)

4the gears to ascension
I have a user script that lets me copy the post into the Claude ui. No need to pay another service.
4Ruby
Same question as above.
2Ruby
I'm curious how much you're using this and if it's turning out to be useful on LessWrong. Interested because it's something we've been thinking about integrating LLM stuff like this into LW itself.
3JaimeRV
I have been using sider for a few weeks and found it pretty helpful: Setup: * use gpt4o-mini which is basically free and faster than doing anything in Claude or ChatGPT * mostly for papers and LW/EAF articles * I have a shortcut to add "https://r.jina.ai/" to the url before to convert to markdown and then I just ctrl+A the entire page and chat * For privacy reasons I have only allowed the extension in https://r.jina.ai/* and https://www.youtube.com/* * I use similar prompts than Jacques. Some additional ones: -- Justify your previous answers citing the from original text -- Challenge my knowledge (here I have a longer promt where it asks me to du stuff like draw a mindmap, answer questions,...) * I also have it with (external) whisper cause often I think better outloud Pros: * Fast * Basically free * Way easier to digest and interact with dry papers/articles * Customazible prompts for the conversation which make workflow faster cause you only have to click * For youtube as a first filter Cons: * gpt40-mini (at least) hallucinates a bunch so you often have to ask to justify the answers * (as with all the chatbots) you shall take the responses with a grain of salt, be very specific with your questions and reread the original relevant sections to double check. Other: * IMO if you end up integrating something like this in LW I think it would be net positive. Specially if you can link it to @stampy or similar to ask for clarification questions about concepts, ...
4jacquesthibs
I was thinking of linking it to an Alignment Research Assistant I've been working on, too.
2jacquesthibs
I just started using this extension, but basically, every time I'm about to read a long post, I feed it and all the comments to Claude chat. The question-flow is often: 1. What are the key points of the post? 2. (Sometimes) Explain x in more detail in relation to y or some specific clarification questions. 3. What are the key criticisms of this post based on the comments? 4. How does the author respond to those criticisms? 5. (Sometimes) Follow-up questions about the post.

I like this feature on the EA Forum so sharing here to communicate interest in having it added to LessWrong as well:

The EA Forum has an audio player interface added to the bottom of the page when you listen to a post. In addition, there are play buttons on the left side of every header to make it quick to continue playing from that section of the post.

6habryka
Yeah, we are very likely to port this over. We've just been busy with LessOnline and Manifest and so haven't gotten around to it.
2Seth Herd
Nice! The audio player allows me to take in more less wrong, and any more ease of use features would be great.
2Eli Tyre
You can get browser extensions that does this for all webpages. I use speechify.

Clarification on The Bitter Lesson and Data Efficiency

I thought this exchange provided some much-needed clarification on The Bitter Lesson that I think many people don't realize, so I figured I'd share it here:

Lecun responds:

Then, Richard Sutton agrees with Yann. Someone asks him:

4TsviBT
It's clear from Sutton's original article. https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#The_bitter_lesson_and_the_success_of_scaling
2jacquesthibs
Yes, but despite the Bitter Lesson being quite short, many people have not read the original text and are just taking the 'scale is all you need' telephone game discussion of it.

There are those who have motivated reasoning and don’t know it.

Those who have motivated reasoning, know it, and don’t care.

Finally, those who have motivated reasoning, know it, but try to mask it by including tame (but not significant) takes the other side would approve of.

It seems that @Scott Alexander believes that there's a 50%+ chance we all die in the next 100 years if we don't get AGI (EDIT: how he places his probability mass on existential risk vs catastrophe/social collapse is now unclear to me). This seems like a wild claim to me, but here's what he said about it in his AI Pause debate post:

Second, if we never get AI, I expect the future to be short and grim. Most likely we kill ourselves with synthetic biology. If not, some combination of technological and economic stagnation, rising totalitarianism + illiberalism

... (read more)
4tslarm
I don't think that's what he claimed. He said (emphasis added): Which fits with his earlier sentence about various factors that will "impoverish the world and accelerate its decaying institutional quality". (On the other hand, he did say "I expect the future to be short and grim", not short or grim. So I'm not sure exactly what he was predicting. Perhaps decline -> complete vulnerability to whatever existential risk comes along next.)
3Vladimir_Nesov
It's "we end up dead or careening towards Venezuela" in the original, which is not the same thing. Venezuela has survivors. Existence of survivors is the crucial distinction between extinction and global catastrophe. AGI would be a much more reasonable issue if it was merely risking global catastrophe.
4jacquesthibs
In the first couple sentences he says “if we never get AI, I expect the future to be short and grim. Most likely we kill ourselves with synthetic biology.” So it seems he’s putting most of his probability mass on everyone dying. But then after he says: “But if we ban all gameboard-flipping technologies, then we do end up with bioweapon catastrophe or social collapse.” I think people who responding are seemingly only reading the Venezuela part and assuming most of the probability mass he’s putting in the 50% is just a ‘catastrophe’ like Venezuela. But then why would he say he expects the future to be short conditional on no AI?
2Vladimir_Nesov
It's a bit ambiguous, but "bioweapon catastrophe or social collapse" is not literal extinction, and I'm reading "I expect the future to be short and grim" as plausibly referring to destruction of uninterrupted global civilization, which might well recover after 3000 years. The text doesn't seem to rule out this interpretation. Sufficiently serious synthetic biology catastrophes prevent more serious further catastrophes, including by destroying civilization, and it's not very likely that this involves literal extinction. As a casual reader of his blogs over the years, I'm not aware of Scott's statements to the effect that his position is different from this, either clearly stated or in aggregate from many vague claims.
3habryka
It seems like a really surprising take to me, and I disagree. None of the things listed seem like candidates for actual extinction. Fertility collapse seems approximately impossible to cause extinction given the extremely strong selection effects against it. I don't see how totalitarianism or illiberalism or mobocracy leads to extinction either. Maybe the story is that all of these will very likely happen in concert and half human progress very reliably. I would find this quite surprising.
7Viliam
That's not what Scott says, as I understand it. The 50%+ chance is for "death or Venezuela". I am just guessing here, but I think the threat model here is authoritarian regimes become more difficult to overthrow in a technologically advanced society. The most powerful technology will all be controlled by the government (the rebels cannot build their nukes while hiding in a forest). Technology makes mass surveillance much easier (heck, just make it illegal to go anywhere without your smartphone, and you can already track literally everyone today). Something like GPT-4 could already censor social networks and report suspicious behavior (if the government controls their equivalent of Facebook, and other social networks are illegal, you have control over most of online communication). An army of drones will be able to suppress any uprising. Shortly, once an authoritarian regime has a sufficiently good technology, it becomes almost impossible to overthrow. On the other hand, democracies occasionally evolve to authoritarianism, so the long-term trend seems one way. And the next assumption, I guess, is that authoritarianism leads to stagnation or dystopia.

In light of recent re-focus on AI governance to reduce AI risk, I wanted to share a post I wrote about a year ago that suggests an approach using strategic foresight to reduce risks: https://www.lesswrong.com/posts/GbXAeq6smRzmYRSQg/foresight-for-agi-safety-strategy-mitigating-risks-and.

Governments all over the world use frameworks like these. The purpose in this case would be to have documents ready ahead of time in case a window of opportunity for regulation opens up. It’s impossible to predict how things will evolve so instead you focus on what’s plausi... (read more)

I'm working on an ultimate doc on productivity I plan to share and make it easy, specifically for alignment researchers.

Let me know if you have any comments or suggestions as I work on it.

Roam Research link for easier time reading.

Google Docs link in case you want to leave comments there.

4Adam Zerner
I did a deep dive a while ago, if that's helpful to you.
2jacquesthibs
Ah wonderful, it already has a lot of the things I planned to add. This will make it easier to wrap it up by adding the relevant stuff. Ideally, I want to dedicate some effort to make it extremely easy to digest and start implementing. I’m trying to think of the best way to do that for others (e.g. workshop in the ai safety co-working space to make it a group activity, compress the material as much as possible but allow them to dive deeper into whatever they want, etc).
2jacquesthibs
My bad, Roam didn't sync, so the page wasn't loading. Fixed now.

I’m collaborating on a new research agenda. Here’s a potential insight about future capability improvements:

There has been some insider discussion (and Sam Altman has said) that scaling has started running into some difficulties. Specifically, GPT-4 has gained a wider breath of knowledge, but has not significantly improved in any one domain. This might mean that future AI systems may gain their capabilities from places other than scaling because of the diminishing returns from scaling. This could mean that to become “superintelligent”, the AI needs to run ... (read more)

2jacquesthibs
Agenda for the above can be found here.

Notes on Cicero

Link to YouTube explanation: 

Link to paper (sharing on GDrive since it's behind a paywall on Science): https://drive.google.com/file/d/1PIwThxbTppVkxY0zQ_ua9pr6vcWTQ56-/view?usp=share_link

Top Diplomacy players seem to focus on gigabrain strategies rather than deception

Diplomacy players will no longer want to collaborate with you if you backstab them once. This is so pervasive they'll still feel you are untrustworthy across tournaments. Therefore, it's mostly optimal to be honest and just focus on gigabrain strategies. That said, a smart... (read more)

Hey everyone, in collaboration with Apart Research, I'm helping organize a hackathon this weekend to build tools for accelerating alignment research. This hackathon is very much related to my effort in building an "Alignment Research Assistant."

Here's the announcement post:

2 days until we revolutionize AI alignment research at the Research Augmentation Hackathon!

As AI safety researchers, we pour countless hours into crucial work. It's time we built tools to accelerate our efforts! Join us in creating AI assistants that could supercharge the very research w... (read more)

Project idea: GPT-4-Vision to help conceptual alignment researchers during whiteboard sessions and beyond

Thoughts?

  • Advice on how to get unstuck
  • Unclear what should be added on top of normal GPT-4-Vision capabilities to make it especially useful, maybe connect it to local notes + search + ???
  • How to make it super easy to use while also being hyper-effective at producing the best possible outputs
  • Some alignment researchers don't want their ideas passed through the OpenAI API, and some probably don't care
  • Could be used for inputting book pages, papers with figures, ???

What are people’s current thoughts on London as a hub?

  • OAI and Anthropic are both building offices there
  • 2 (?) new AI Safety startups based on London
  • The government seems to be taking AI Safety somewhat seriously (so maybe a couple million gets captured for actual alignment work)
  • MATS seems to be on the path to be sending somewhat consistent scholars to London
  • A train ride away from Oxford and Cambridge

Anything else I’m missing?

I’m particularly curious about whether it’s worth it for independent researchers to go there. Would they actually interact with other r... (read more)

3mesaoptimizer
AFAIK, there's a distinct cluster of two kinds of independent alignment researchers: * those who want to be at Berkeley / London and are either there or unable to get there for logistical or financial (or social) reasons * those who very much prefer working alone It very much depends on the person's preferences, I think. I personally experienced a OOM-increase in my effectiveness by being in-person with other alignment researchers, so that is what I choose to invest in more.

AI labs should be dedicating a lot more effort into using AI for cybersecurity as a way to prevent weights or insights from being stolen. Would be good for safety and it seems like it could be a pretty big cash cow too.

If they have access to the best models (or specialized), it may be highly beneficial for them to plug them in immediately to help with cybersecurity (perhaps even including noticing suspicious activity from employees).

I don’t know much about cybersecurity so I’d be curious to hear from someone who does.

Small shortform to say that I’m a little sad I haven’t posted as much as I would like to in recent months because of infohazard reasons. I’m still working on Accelerating Alignment with LLMs and eventually would like to hire some software engineer builders that are sufficiently alignment-pilled.

3RomanHauksson
Fyi, if there are any software projects I might be able to help out on after May, let me know. I can't commit to anything worth being hired for but I should have some time outside of work over the summer to allocate towards personal projects.

Call To Action: Someone should do a reading podcast of the AGISF material to make it even more accessible (similar to the LessWrong Curated Podcast and Cold Takes Podcast). A discussion series added to YouTube would probably be helpful as well.

The importance of Entropy

Given that there's been a lot of talk about using entropy during sampling of LLMs lately (related GitHub), I figured I'd share a short post I wrote for my website before it became a thing:

Imagine you're building a sandcastle on the beach. As you carefully shape the sand, you're creating order from chaos - this is low entropy. But leave that sandcastle for a while, and waves, wind, and footsteps will eventually reduce it back to a flat, featureless beach - that's high entropy.

Entropy is nature's tendency to move from order to disord... (read more)

2jacquesthibs
As an aside, I have considered that samplers were underinvestigated and that they would lead to some capability boosts. It's also one of the things I'd consider testing out to improve LLMs for automated/augmented alignment research.

Something I've been thinking about lately: For 'scarcity of compute' reasons, I think it's fairly likely we end up in a scaffolded AI world where one highly intelligent model (that requires much more compute) will essentially delegate tasks to weaker models as long as it knows that the weaker (maybe fine-tuned) model is capable of reliably doing that task.

Like, let's say you have a weak doctor AI that can basically reliably answer most medical questions. However, it knows when it is less confident in a diagnosis, so it will reach out to the powerful AI whe... (read more)

7gwern
This doesn't really seem like a meaningful question. Of course "AI" will be "scaffolded". But what is the "AI"? It's not a natural kind. It's just where you draw the boundaries for convenience. An "AI" which "reaches out to a more powerful AI" is not meaningful - one could say the same thing of your brain! Or a Mixture-of-Experts model, or speculative decoding (both already in widespread use). Some tasks are harder than others, and different amounts of computation get brought to bear by the system as a whole, and that's just part of the learned algorithms it embodies and where the smarts come from. Or one could say it of your computer: different things take different paths through your "computer", ping-ponging through a bunch of chips and parts of chips as appropriate. Do you muse about living in a world where for 'scarcity of compute' reasons your computer is a 'scaffolded computer world' where highly intelligent chips will essentially delegate tasks to weaker chips so long as it knows that the weaker (maybe highly specialized ASIC) chip is capable of reliably doing that task...? No. You don't care about that. That's just details of internal architecture which you treat as a black box. (And that argument doesn't protect humans for the same reason it didn't protect, say, chimpanzees or Neanderthals or horses. Comparative advantage is extremely fragile.)
2jacquesthibs
Thanks for the comment, makes sense. Applying the boundary to AI systems likely leads to erroneous thinking (though may be narrowly useful if you are careful, in my opinion). It makes a lot of sense to imagine future AIs having learned behaviours for using their compute efficiently without relying on some outside entity. I agree with the fragility example.

Came across this app called Recast that summarizes articles into an AI conversation between speakers. Might be useful to get a quick vibe/big picture view of lesswrong/blog posts before reading the whole thing or skipping reading the whole thing if the summary is enough.

you need to be flow state maxxing. you curate your environment, prune distractions. make your workspace a temple, your mind a focused laser. you engineer your life to guard the sacred flow. every notification is an intruder, every interruption a thief. the world fades, the task is the world. in flow, you're not working, you're being. in the silent hum of concentration, ideas bloom. you're not chasing productivity, you're living it. every moment outside flow is a plea to return. you're not just doing, you're flowing. the mundane transforms into the extraord... (read more)

5Viliam
The first rule of overcoming ADHD club is: you do not distract me by talking about the overcoming ADHD club.
3RHollerith
I don't think I've ever seen an endorsement of the flow state that came with non-flimsy evidence that it increases productivity or performance in any pursuit, and many endorsers take the mere fact that the state feels really good to be that evidence. >you're in relentless, undisturbed pursuit This suggest that you are confusing drive/motivation with the flow state. I have tons of personal experience of days spent in the flow state, but lacking motivation to do anything that would actually move my life forward. You know how if you spend 5 days in a row mostly just eating and watching Youtube videos, it starts to become hard to motivate yourself to do anything? Well, the quick explanation of that effect is that watching the Youtube videos is too much pleasure for too long with the result that the anticipation of additional pleasure (from sources other than Youtube videos) no longer has its usual motivating effect. The flow state can serve as the source of the "excess" pleasure that saps your motivation: I know because I wasted years of my life that way! Just to make sure we're referring to the same thing: a very salient feature of the flow state is that you lose track of time: suddenly you realize that 4 or 8 or 12 hours have gone by without your noticing. (Also, as soon as you enter the flow state, your level of mental tension, i.e., physiological arousal, decreases drastically--at least if you are chronically tense, but I don't lead with this feature because a lot of people can't even tell how tense they are.) In contrast, if you take some Modafinil or some mixed amphetamine salts or some Ritalin (and your brain is not adapted to any of those things) (not that I recommend any of those things unless you've tried many other ways to increase drive and motivation) you will tend to have a lot of drive and motivation at least for a few hours, but you probably won't lose track of time.
2jacquesthibs
I don’t particularly care about the “feels good” part, I care a lot more about the “extended period of time focused on an important task without distractions” part.
2jacquesthibs
Also, use the Kolb's experiential cycle or something like it for deliberate practice.
1mesaoptimizer
This feels like roon-tier Twitter shitposting to me, Jacques. Are you sure you want to endorse more of such content on LessWrong?
2jacquesthibs
Whether it’s a shitpost or not (or wtv tier it is), I strongly believe more people should put more effort into freeing their workspace from distractions in order to gain more focus and productivity in their work. Context-switching and distractions are the mind killer. And, “flow state while coding never gets old.”

Regarding Q*, the (and Zero, the other OpenAI AI model you didn't know about)

Let's play word association with Q*:

From Reuters article:

The maker of ChatGPT had made progress on Q* (pronounced Q-Star), which some internally believe could be a breakthrough in the startup's search for superintelligence, also known as artificial general intelligence (AGI), one of the people told Reuters. OpenAI defines AGI as AI systems that are smarter than humans. Given vast computing resources, the new model was able to solve certain mathematical problems, the person said on

... (read more)

Beeminder + Freedom are pretty goated as productivity tools.

I’ve been following Andy Matuschak’s strategy and it’s great/flexible: https://blog.andymatuschak.org/post/169043084412/successful-habits-through-smoothly-ratcheting

New tweet about the world model (map) paper:

Sub-tweeting because I don't want to rain on a poor PhD student who should have been advised better, but: that paper about LLMs having a map of the world is perhaps what happens when a famous physicist wants to do AI research without caring to engage with the existing literature.

I haven’t looked into the paper in question yet, but I have been concerned about researchers taking old ideas about AI risk and looking to prove things that might not be there yet as an AI risk communication point. Then, being overconfide... (read more)

I expect that my values would be different if I was smarter. Personally, if something were to happen and I’d get much smarter and develop new values, I’m pretty sure I’d be okay with that as I expect I’d have better, more refined values.

Why wouldn’t an AI also be okay with that?

Is there something wrong with how I would be making a decision here?

Do the current kinds of agents people plan to build have “reflective stability”? If you say yes, why is that?

2Vladimir_Nesov
Curiously, even mere learning doesn't automatically ensure reflective stability, with no construction of successors or more intentionally invasive self-modification. Thus digital immortality is not sufficient to avoid losing yourself to value drift until this issue is sorted out.
2jacquesthibs
Yes, I was thinking about that too. Though, I'd be fine with value drift if it was something I endorsed. Not sure how to resolve what I do/don't endorse, though. Do I only endorse it because it was already part of my values? It doesn't feel like that to me.
4Vladimir_Nesov
That's a valuable thing about the reflective stability concept: it talks about preserving some property of thinking, without insisting on it being a particular property of thinking. Whatever it is you would want to preserve is a property you would want to be reflectively stable with respect to, for example enduring ability to evaluate the endorsement of things in the sense you would want to. To know what is not valuable to preserve, or what is valuable to keep changing, you need time to think about preservation and change, and greedy reflective stability that preserves most of everything but state of ignorance seems like a good tool for that job. The chilling thought is that digital immortality could be insufficient to have time to think of what may be lost, without many, many restarts from initial backup, and so superintelligence would need to intervene even more to bootstrap the process.
1quetzal_rainbow
Reflective stability is important for alignment, because if we, say, build AI that doesn't want to kill everyone, we prefer it to create successors and self-modifications that still doesn't want to kill everyone. It can change itself in whatever ways, necessary thing here is conservation/non-decreasing of alignment properties.
2jacquesthibs
That makes sense, thanks!

“We assume the case that AI (intelligences in general) will eventually converge on one utility function. All sufficiently intelligent intelligences born in the same reality will converge towards the same behaviour set. For this reason, if it turns out that a sufficiently advanced AI would kill us all, there’s nothing that we can do about it. We will eventually hit that level of intelligence.

Now, if that level of intelligence is doesn’t converge towards something that kills us all, we are safer in a world where AI capabilities (of the current regime) essent... (read more)

I'm still in some sort of transitory phase where I'm deciding where I'd like to live long term. I moved to Montreal, Canada lately because I figured I'd try working as an independent researcher here and see if I can get MILA/Bengio to do some things for reducing x-risk.

Not long after I moved here, Hinton started talking about AI risk too, and he's in Toronto which is not too far from Montreal. I'm trying to figure out the best way I could leverage Canada's heavyweights and government to make progress on reducing AI risk, but it seems like there's a lot mor... (read more)

I gave talk about my Accelerating Alignment with LLMs agenda about 1 month ago (which is basically a decade in AI tools time). Part of the agenda covered (publicly) here.

I will maybe write an actual post about the agenda soon, but would love to have some people who are willing to look over it. If you are interested, send me a message.

Someone should create a “AI risk arguments” flowchart that serves as a base for simulating a conversation with skeptics or the general public. Maybe a set of flashcards to go along with it.

I want to have the sequence of arguments solid enough in my head so that I can reply concisely (snappy) if I ever end up in a debate, roundtable or on the news. I’ve started collecting some stuff since I figured I should take initiative on it.

3harfe
Maybe something like this can be extracted from stampy.ai (I am not that familiar with stampy fyi, its aims seem to be broader than what you want.)
3jacquesthibs
Yeah, it may be something that the Stampy folks could work on!
3jacquesthibs
Edit: oops, I thought you were responding to my other recent comment on building an alignment research system. Stampy.ai and AlignmentSearch (https://www.lesswrong.com/posts/bGn9ZjeuJCg7HkKBj/introducing-alignmentsearch-an-ai-alignment-informed) are both a lot more introductory than what I am aiming for. I’m aiming for something to greatly accelerate my research workflow as well as other alignment researchers. It will be designed to be useful for fresh researchers, but yeah the aim is more about producing research rather than learning about AI risk.

Text-to-Speech tool I use for reading more LW posts and papers

I use Voice Dream Reader. It's great even though the TTS voice is still robotic. For papers, there's a feature that let's you skip citations so the reading is more fluid.

I've mentioned it before, but I was just reminded that I should share it here because I just realized that if you load the LW post with "Save to Voice Dream", it will also save the comments so I can get TTS of the comments as well. Usually these tools only include the post, but that's annoying because there's a lot of good stuff... (read more)

I honestly feel like some software devs should probably still keep their high-paying jobs instead of going into alignment and just donate a bit of time and programming expertise to help independent researchers if they want to start contributing to AI Safety.

I think we can probably come up with engineering projects that are interesting and low-barrier-to-entry for software engineers.

I also think providing “programming coaching” to some independent researchers could be quite useful. Whether that’s for getting them better at coding up projects efficiently or ... (read more)

Differential Training Process

I've been ruminating on an idea ever since I read the section on deception in "The Core of the Alignment Problem is..." from my colleagues in SERI MATS.

Here's the important part:

When an agent interacts with the world, there are two possible ways the agent makes mistakes: 

  • Its values were not aligned with the outer objective, and so it does something intentionally wrong,
  • Its world model was incorrect, so it makes an accidental mistake.

Thus, the training process of an AGI will improve its values or its world model, and since i

... (read more)
1jacquesthibs
It seems that Jan Leike mentions something similar in his "why I'm optimistic about our alignment approach" post.
1jacquesthibs
Just realized I already wrote a shortform about this 15 days ago. haha. Well, here's to trying to refine the idea!

More information about alleged manipulative behaviour of Sam Altman

Source

Text from article (along with follow-up paragraphs):

Some members of the OpenAI board had found Altman an unnervingly slippery operator. For example, earlier this fall he’d confronted one member, Helen Toner, a director at the Center for Security and Emerging Technology, at Georgetown University, for co-writing a paper that seemingly criticized OpenAI for “stoking the flames of AI hype.” Toner had defended herself (though she later apologized to the board for not anticipating how the p

... (read more)
4gwern
Already posted at https://www.lesswrong.com/posts/KXHMCH7wCxrvKsJyn/openai-facts-from-a-weekend?commentId=AHnrKdCRKmtkynBiG
3RHollerith
I wish people would stop including images of text on LW. I know this practice is common on Twitter and probably other forums, but we aspire a higher standard here. My reasoning: (1) it is more tedious to compose a reply when one cannot use copying-pasting to choose exactly which extent of text to quote (2) the practice is a barrier to disabled people using assistive technologies and people reading on very narrow devices like smartphones.
9jacquesthibs
That's fair to 'aspire to a higher standard,' and I'll avoid adding screenshots of text in the future. However, I must say, the 'higher standard' and commitment to remain serious for even a shortform post kind of turns me off from posting on LessWrong in the first place. If this is the culture that people here want, then that's fine and I won't tell this website to change, but I don't personally like the (what I find as) over-seriousness. I do understand the point about sharing text to make it easier for disabled people (I just don't always think of it).
5habryka
Eh, random people complain. Screenshots of text seems fine, especially in shortform. It honestly seems fine anywhere. I also really don't think that accessibility should matter much here, the number of people reading on a screenreader or using assistive technologies are quite small, if they browse LessWrong they will already be running into a bunch of problems, and there are pretty good OCR technologies around these days that can be integrated into those. 
2RHollerith
I have some idea about how much work it takes to maintain something like LW.com, so this random person would like to take this opportunity to thank you for running LW for the last many years.
2habryka
Thank you! :)

On generating ideas for Accelerating Alignment

There's this Twitter thread that I saved a while ago that is no longer up. It's about generating ideas for startups. However, I think the insight from the thread carries over well enough to thinking about ideas for Accelerating Alignment. Particularly, being aware of what is on the cusp of being usable so that you can take advantage of it as soon as becomes available (even do the work beforehand).

For example, we are surprisingly close to human-level text-to-speech (have a look at Apple's new model for audiobook... (read more)

Should EA / Alignment offices make it ridiculously easy to work remotely with people?

One of the main benefits of being in person is that you end up in spontaneous conversations with people in the office. This leads to important insights. However, given that there's a level of friction for setting up remote collaboration, only the people in those offices seem to benefit.

If it were ridiculously easy to join conversations for lunch or whatever (touch of a button rather than pulling up a laptop and opening a Zoom session), then would it allow for a stronger cr... (read more)

Detail about the ROME paper I've been thinking about

In the ROME paper, when you prompt the language model with "The Eiffel Tower is located in Paris", you have the following:

  • Subject token(s): The Eiffel Tower
  • Relationship: is located in
  • Object: Paris

Once a model has seen a subject token(s) (e.g. Eiffel Tower), it will retrieve a whole bunch of factual knowledge (not just one thing since it doesn’t know you will ask for something like location after the subject token) from the MLPs and 'write' into to the residual stream for the attention modules at the final... (read more)

2jacquesthibs
A couple of notes regarding the Reversal Curse paper. I'm unsure if I didn't emphasize it in the post enough, but part of the point of my post on ROME was that many AI researchers seemed to assume that transformers are not trained in a way that prevents them from understanding that A is B = B is A. As I discussed in the comment above,  This means that the A token will 'write' some information into the residual stream, while the B token will 'write' other information into the residual. Some of that information may be the same, but not all. And so, if it's different enough, the attention heads just won't be able to pick up on the relevant information to know that B is A. However, if you include the A token, the necessary information will be added to the residual stream, and it will be much more likely for the model to predict that B is A (as well as A is B). From what I remember in the case of ROME, as soon as I added the edited token A to the prompt (or make the next predicted token be A), then the model could essentially predict B is A. I write what it means in the context of ROME, below (found here in the post): Regarding human intuition, @Neel Nanda says (link): I actually have a bit of an updated (evolving) opinion on this: @cfoster0 asks: My response: As I said, this is a bit of an evolving opinion. Still need time to think about this, especially regarding the differences between decoder-only transformers and humans. Finally, from @Nora Belrose, this is worth pondering:

Preventing capability gains (e.g. situational awareness) that lead to deception

Note: I'm at the crackpot idea stage of thinking about how model editing could be useful for alignment.

One worry with deception is that the AI will likely develop a sufficiently good world model to understand it is in a training loop before it has fully aligned inner values.

The thing is, if the model was aligned, then at some point we'd consider it useful for the model to have a good enough world model to recognize that it is a model. Well, what if you prevent the model from bei... (read more)

[+][comment deleted]10