TLDR; We report our intermediate results from the AI Safety Camp project “Mechanistic Interpretability Via Learning Differential Equations”. Our goal was to explore transformers that deal with time-series numerical data (either infer the governing differential equation or predict the next number). As the task is well formalized, this seems to be an easier problem than interpreting a transformer that deals with language. During the time of the project, we leveraged various interpretability methods for the problem at hand. We also obtained some preliminary results (e.g., we observed a pattern similar to numerical computation of the input data derivative). We plan to continue working on it to validate and extend these preliminary results. ... (read 1809 more words →)
Omohundro, in his paper, assumes goal-oriented AIs with well-defined utility functions, but contemporary models like GPT-3.5 and Claude maximally exhibit behaviours without explicit goals. Advanced models such as GPT-o1 and Claude-3.5, however, have shown preliminary indications of goal-oriented behaviours in controlled environments.
Omohundro’s reliance on predictable rationality is undermined by emergent capabilities- for example, grokking, multi-step reasoning and zero-shot learning- which arise unpredictably due to phase transitions, often caused by scaling parameters. These behaviours challenge the linear assumptions of rationality and gradual capability development. Research has also shown that AI systems can develop emergent behaviour when placed in competitive or adversarial scenarios. This paper also underestimates risks like latent misalignment, where AIs exploit... (read 962 more words →)