I think we need to consider another avenue in which our emotions are generated, and effect our lives. An immediate, short to medium term high is, in a way, the least valuable personal return we can expect from our actions. However, there is a more subtle yet long lasting emotional effect, which is more strongly correlated to our belief system, and our rationality. I refer to a feeling of purpose we can have on a daily basis, a feeling of maximizing personal potential, and even long term happiness. This is created when we believe we are doing the right thing, when we know there is till more to be done, and continue to make an effort. A good example of this is the difference between falling in love and being in love for a lifetime. Another example is raising children. Every few months I sit in front of my computer and punch in a bunch of numbers, which result in a donation to GiveWell. The immediate emotional impact of this is about on par with eating a mediocre sandwich. However, every day I remind myself that that day's work contributes to my ability to make bigger and bigger donation. Also, every so often I am hit with the realization that I, insignificant little me, have saved people's lives, and can save more. That perhaps my existence on this planet will do more good than harm. The contribution of this to my overall emotional well being cannot be overstated. I think we can redefine caring along these lines. Then we will see that we do care, not only in action, but also in feeling. Any emotion that actually matters is not a momentary peak or trough.
“Models that are only pre-trained almost certainly don’t have consequentialist goals beyond the trivial next token prediction.”
Why is it impossible for our model which is pre-trained on the whole internet to pick up consequentialism and maximization, especially when it is already picking up non-consequentialist ethics and developing a “nuanced understanding” and “some understanding of direction following … without any reinforcement learning”? Why is it not possible to gain goal-directness from pre-training on the whole internet, thereby learning it before the base goal is conceptualized/understood? For that matter, why can’t the model pickup goal-directedness and a proxy-goal at this stage? To complicate matters more couldn’t it pick up goal-directedness and a proxy-goal without picking up consequentialism and maximization?