a hypothetical typical example would be it tries to use the file /usr/bin/python because it's memorized that that's the path to python, that fails, then it concludes it must create that folder which would require sudo permissions, if it can it could potentially mess something
not running amock, just not reliably following instructions "only modify files in this folder" or "don't install pip packages". Claude follows instructions correctly, some other models are mode collapsed into a certain way of doing things, eg gpt-4o always thinks it's running python in chatgpt code interpreter and you need very strong prompting to make it behave in a way specific to your computer
i've recently done more AI agents running amok and i've found Claude was actually more aligned and did stuff i asked it not to much less than oai models enough that it actaully made a difference lol
i'd guess effort at google/banks to be more leveraged than demos if you're only considering harm from scams and not general ai slowdown and risk
Working on anti spam/scam features at Google or banks could be a leveraged intervention on some worldviews. As AI advances it will be more difficult for most people to avoid getting scammed, and including really great protections into popular messaging platforms and banks could redistribute a lot of money from AIs to humans
Like the post! I'm very interested in how the capabilities of prediction vs character are changing with more recent models. Eg sonnet new may have more of its capabilities tied to its character. And Reasoning models have maybe a fourth layer between ground and character, possibly even completely replacing ground layer in highly distilled models
there is https://shop.nist.gov/ccrz__ProductList?categoryId=a0l3d0000005KqSAAU&cclcl=en_US which fulfils some of this
Wow thank you for replying so fast! I donated $5k just now, mainly because you reminded me that lightcone may not meet goal 1 and that's definitely worth meeting.
About web design, am only slightly persuaded by your response. In the example of Twitter, I don't really buy that there's public evidence that twitter's website work besides user-invisible algorithm changes has had much impact. I only use Following page, don't use spaces, lists, voice, or anything on twitter. Comparing twitter with bluesky/threads/whatever, really looks to me like cultural stuff, moderation, and advertisement are the meat, not the sites. Something like StackOverflow has more complexity that actually impacts website, in some way (like there is lots of implicit complexity in tweet reply trees and social groups but that only impacts website through user-invisible algorithms). And a core part of my model is that recommendation algoritms have a much lower ceiling for LessWrong because it doesn't have enough data volume. Like I don't expect to miss stuff i really wanted to see on LW, reading the titles of most posts isn't hard (i also have people recommend posts in person which helps...). Maybe in my model StackOverflow is at the ceiling of web dev leveraged-ness, because there is enough volume of posts written by quality people who can be nudged to spend a little more time on quality and can be sorted through, or something (vague thought).
When I look at lesswrong, it seems extremely bottlenecked on post quality. I think having the best AIs (o3 when it comes out might help significantly) help write and improve the core content of posts might make a big difference. I would bet that interventions that don't route through more effort/intelligence/knowledge going into writing main posts would make me like LessWrong much more.
My main crux about how valuable Lightcone donations are is how impactful great web dev on LessWrong is. If I look around, impact of websites doesn't look strongly correlated with web design, expecially on the very high end. My model is more like platforms / social networks rise or fall by zeitgeist, moderation, big influencers/campaigns (eg elon musk for twitter), web design, in that order. Olli has thought about this much more than me, maybe he's right. I certainly don't believe there's a good argument for LW web dev is responsible for its user metrics. Zeitgeist, moderation, and lightcone people personally posting seems likely more important to me. Lightcone is still great despite my (uninformed) disagreement!
Interested to see evaluations on tasks not selected to be reward-hackable and try to make performance closer to competitive with standard RL