BrianTan

I'm the co-founder and operations director of WhiteBox Research.  WhiteBox aims to develop more AI interpretability and safety researchers in Asia. I'm also a co-founder of EA Philippines.

I previously was a Group Support Contractor for the Centre for Effective Altruism (CEA) for two years, where I helped support EA groups around the world. 

You can reach out to me at brian@whiteboxresearch.org or find me on LinkedIn.

Wikitag Contributions

Comments

Sorted by
BrianTan184

Thanks for linking these! I also want to highlight that Sam shared his AGI timeline in the Bloomberg interview: "I think AGI will probably get developed during this president’s term, and getting that right seems really important."

BrianTan*86

My typo reaction may have glitched, but I think you meant "Don't push the frontier of capabilities" in the last bullet?

I've only read the blog post and a bit of the paper so far, but do you plan to investigate how to remove alignment faking in these situations? I wonder if there are simple methods to do so without negatively affecting the model's capabilities and safety.

Thanks for doing this important research! I may have found 2 minor typos:

  1. The abstract says "We find the model complies with harmful queries from free users 14% of the time", but in other places it says 12% - should it be 12%?
  2. In the blog post, "sabotage evaluations" seems to link to a private link

Thanks for this analysis! A minor note: you're probably aware of this, but OpenPhil funds a lot of technical AI safety field-building work as part of their "Global Catastrophic Risks Capacity Building" grants. So the proportion of field-building / talent-development grants would be significantly higher if those were included.

BrianTan20

Thanks for making this! This is minor, but I think the total should be $189M and not $169M?

Your last sentence in the first paragraph seems to be cut off at "gets a lot more than"!

I'm following up on Leon's question - have the results already been posted? If not, when will they be posted (if they will be)? I'm curious to know.  Thanks!

Thanks for this. This tweet from Dr. Jacob Glanville, founder and CEO of Centivax, makes me worried about this variant too: 

The new B.1.1.529 strain out of South Africa has 15 mutations in the RBD where majority of neutralizing antibodies bind. The current vaccines and even Delta-based vaccines probably won’t work against this new strain. Swift, vigorous containment is needed.

Load More