jacob_cannell

I have a compute-market startup called vast.ai, and I'm working towards aligned AI. Currently seeking networking, collaborators, and hires - especially top notch cuda/gpu programmers.

My personal blog: https://entersingularity.wordpress.com/

Wiki Contributions

Comments

How is that even remotely relevant? Humans and AIs learn the same way, via language - and its not like this learning process fails just because language undersamples thoughts.

As the article points out, shared biological needs do not much deter the bear or chimpanzee from killing you. An AI could be perfectly human - the very opposite of alien - and far more dangerous than Hitler or Dhamer.

The article is well written but dangerously wrong in its core point. AI will be far more human than alien. But alignment/altruism is mostly orthogonal to human vs alien.

We are definitely not training AIs on human thoughts because language is an expression of thought, not thought itself.

Even if training on language was not equivalent to training on thoughts, that would also apply to humans.

But it also seems false in the same way that "we are definitely not training AI's on reality because image files are compressed sampled expressions of images, not reality itself" is false.

Approximate bayesian inference (ie DL) can infer the structure of a function through its outputs; the structure of the 3D world through images; and thoughts through language.

Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.

Distinct alien species arise only from distinct separated evolutionary histories. Your example of the aliens from Arrival are indeed a good (hypothetical) example of truly alien minds resulting from a completely independent evolutionary history on an alien world. Any commonalities between us and them would be solely the result of convergent evolutionary features. They would have completely different languages, cultures, etc.

AI is not alien at all, as we literally train AI on human thoughts. As a result we constrain our AI systems profoundly, creating them in our mental image. Any AGI we create will inevitably be far closer to human uploads than alien minds. This a prediction Moravec made as early as 1988 (Mind Children) - now largely fulfilled by the strong circuit convergence/correspondence between modern AI and brains.

Minds are software mental constructs, and alien minds would require alien culture. Instead we are simply creating new hardware for our existing (cultural) mind software.

I also not sure of the relevance and not following the thread fully, but the summary of that experiment is that it takes some time (measured in nights of sleep which are rough equivalent of big batch training updates) for the newly sighted to develop vision, but less time than infants - presumably because the newly sighted already have full functioning sensor inference world models in another modality that can speed up learning through dense top down priors.

But its way way more than "grok it really fast with just a few examples" - training their new visual systems still takes non-trivial training data & time

I suspect that much of the appeal of shard theory is working through detailed explanations of model-free RL with general value function approximation for people who mostly think of AI in terms of planning/search/consequentialism.

But if you already come from a model-free RL value approx perspective, shard theory seems more natural.

Moment to moment decisions are made based on value-function bids, with little to no direct connection to reward or terminal values. The 'shards' are just what learned value-function approximating subcircuits look like in gory detail.

The brain may have a prior towards planning subcircuitry, but even without a strong prior planning submodules will eventually emerge naturally in a model-free RL learning machine of sufficient scale (there is no fundamental difference between model-free and model-based for universal learners). TD like updates ensure that the value function extends over longer timescales as training progresses. (and in general humans seem to plan on timescales which scale with their lifespan, as you'd expect)

TSMC 4N is a little over 1e10 transistors/cm^2 for GPUs and roughly 5e^-18 J switch energy assuming dense activity (little dark silicon). The practical transistor density limit with minimal few electron transistors is somewhere around ~5e11 trans/cm^2, but the minimal viable high speed switching energy is around ~2e^-18J. So there is another 1 to 2 OOM further density scaling, but less room for further switching energy reduction. Thus scaling past this point increasingly involves dark silicon or complex expensive cooling and thus diminishing returns either way.

Achieving 1e-15 J/flop seems doable now for low precision flops (fp4, perhaps fp8 with some tricks/tradeoffs); most of the cost is data movement as pulling even a single bit from RAM just 1 cm away costs around 1e-12J.

Part of the issue is my post/comment was about moore's law (transistor density for mass produced nodes), which is a major input to but distinct from flops/$. As I mentioned somewhere, there is still some free optimization energy in extracting more flops/$ at the circuit level even if moore's law ends. Moore's law is very specifically about fab efficiency as measured in transistors/cm^2 for large chip runs - not the flops/$ habyrka wanted to bet on. Even when moore's law is over, I expect some continued progress in flops/$.

All that being said, nvidia's new flagship GPU everyone is using - the H100 which is replacing the A100 and launched just a bit after habryka proposed the bet - actually offers near zero improvement in flops/$ (the price increased in direct proportion to flops increase). So I probably should have taken the bet if it was narrowly defined as (flops/$ for the flagship gpus most teams using currently for training foundation models).

I don't know who first said it, but the popular saying "Computer vision is the inverse of computer graphics" encompasses much of this viewpoint.

Computer graphics is the study/art of the approximation theory you mention and fairly well developed & understood in terms of how to best simulate worlds & observations in real-time from the perspective of an observer. But of course traditional graphics uses human-designed world models and algorithms.

Diffusion models provide a general framework for learning a generative model in the other direction - in part by inverting trained vision and noise models.

So naturally there is also diffusion planning which is an example of the symmetry you discuss: using general diffusion inference for planning. The graph dimensions end up being both space-time and abstraction level with the latter being more important: sensor inference moves up the abstraction/compression hierarchy, whereas planning/acting/generating moves down.

Even if there is no acceptable way to share the data semi-anonymously outside of match group, the arguments for prediction markets still apply within match group. A well designed prediction market would still be a better way to distribute internal resources and rewards amongst competing data science teams within match group.

But I'm skeptical that the value of match group's private data is dominant even in the fully private data scenario. People who actually match and meetup with another user will probably have important inside view information inaccessible to the algorithms of match group.

Manifold.Love's lack of success is hardly much evidence against the utility of prediction markets for dating markets, any more or less than most startup's failure at X is evidence against the utility of X.

Load More