SociaLLM: proposal for a language model design for personalised apps, social science, and AI safety research

Roman Leventov

I first proposed this model here, as a base model for a proposed app to improve global online discourse through personalised comment ordering on all websites.

This post is also a response to the "Reverse-engineering prosociality" agenda described in the post "The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda".

Architecture and training

SociaLLM is^[1] a foundation language model to be trained on chat, dialogue, and forum data where the identities of message authors (called "users" below) are stable and all messages have timestamps so we can have a global order of them.

SociaLLM design builds upon the Mamba architecture which is a language model with so-called state-space modelling (SSM) blocks instead of self-attention blocks. The model combines SSM blocks that track three separate message streams:
(1) the "local conversation"/flow of messages (which is exactly the training regime of the current LLMs);
(2) the message history of the particular user as well as their general "reading history", which in the forum data could be approximated as previous N (1-10) messages before every user's message;
(3) the message history of the particular interlocutor of the user, which is the subset of the general "reading history" from the previous point, authored by a particular other user.

Training this model would cost from 2 times (on a purely 1-1 dialogue data) to ~10-15 times (on chat room and forum data where messages from the most active users tend to be mixed very well) more than the training of the current LLMs. The data should be wrangled to create training sub-datasets from the perspective of each user pair, but otherwise, the training shouldn't be much fancier or more complicated than the current distributed training algorithms for LLMs (it seems to me).

The first upside of this model is that we can create (what seems to be) strong inductive biases towards developing a large self-other overlap (see also this AI Safety Camp project by AE Studio):
(1) connecting the "user's own" SSM blocks and interlocutor's SSM blocks into the residual stream symmetrically (maybe just through parallel connection, as in multi-head attention);
(2) using the same weights for the user's own and interlocutor's SSM blocks^[2] (at inference time blocks are separate and track states separately, but their weights are the same and updated in lockstep batch after batch); and
(3) probably some extra regularisation techniques, such as intermittent "forgetting" of the either user's own or interlocutor's state (which is not completely unlike some real-world situations for humans: sometimes people tell us that we met before but we don't remember them) and thus teaching the model to degrade gracefully under these circumstances.

Industrial applications

As I already mentioned at the beginning of the post, I originally thought about this model as a base model that can be fine-tuned to predict whether the human user will find this or that information novel, insightful, boring, helpful, saddening, fun, and so on. This fine-tuned model, in turn, could be used within a browser extension to reorder comments on websites (YouTube, Reddit, Facebook, Twitter feed or replies, NYT, The Guardian, etc.) to order the "good" or "informationally valuable" comments first, which (I hope) should change the dynamics of the online echo chambers.

More generally, SociaLLM can improve almost all applications that currently use LLMs and for which personalisation the raw reasoning and creative power: personalised content recommendations and filtering, customer service and engagement, education and language learning assistants, mental health and personal counselling (a-la Pi AI).

In the media and entertainment industries, SociaLLM could also be helpful in narrative analysis (for mass media products, such as movies and novellas) and interactive storytelling for the new forms of media and games.

There are also possible applications that enhance the collective intelligence of teams:

An add-on for team chat platform (such as Slack) that spots the discrepancy of knowledge (or opinion) between team members as described in the paper "Collective Intelligence in Human-AI Teams: A Bayesian Theory of Mind Approach" (Westby & Reidl, 2023).
A conflict resolution app for teams, friend groups, and families.

Research and AI safety applications

The value of SociaLLM in social science research should be obvious: it could be directly used for research and experiments in language intentionality, Theory of Mind, social group or team dynamics, etc.

Beware: the discussion below is somewhat above my pay grade in terms of statistics and ML theory. Take it with a grain and salt, and if something looks to you wrong in it, please point it out.

Collective intelligence mechanisms and research (such as "Collective Intelligence in Human-AI Teams" mentioned above) often require the measure of the information content of the messages that agents send to each other. For SociaLLM to provide such a measure, the user's own and interlocutor's SSM blocks must use the same weights (as suggested above), so we can these SSM blocks as producing the same state representation structure.

Also, for such an informational measure, the SSM blocks should simultaneously provide the energy measure of the current state, i.e., the SSM blocks should simultaneously be Energy-Based Models (EBMs). I'm not sure how to engineer this into SSM blocks. Maybe the techniques from the "Recurrent Neural Filters" paper (Lim, Zohren, and Roberts, 2020) should help, where the Error Correction term aka auto-encoding (posterior) error can be used as the current state's energy. If you have other ideas on how to turn SSM models into (quasi-)energy-based models (or better yet, Bayesian models, but this seems a taller order), please share.

On the AI safety front, SociaLLM could also be used to study (social) deception (e.g., when analysing Diplomacy game logs) and collusion, and, perhaps, help to engineer and test the mechanisms to disincentivise or prevent deception and collusion in AI teams aka agencies.

^{^}
Note: this is a proposal, the model hasn't been trained (or even designed in detail) yet!
^{^}
This feature of the architecture is also important for measuring the information content of the messages in collective intelligence mechanism design and collusion and deception detection, as explained in the section "Research and AI safety applications" below.

Clarity check: this model has not been trained yet at this time, correct? I understood you to be saying the model had already been trained initially, due to the word is in "is a foundation model", as well as "can improve" ("could improve"?), etc. Just wanted to double check and point out the ambiguity.

Seems like a potentially interesting idea though, an interesting sketch of things to try to combine. Doesn't seem like this would help scale to superintelligence, but it could be interesting at least at a more useful near human capability approach. It also seems like it would be quite a research project.

Clarity check: this model has not been trained yet at this time, correct?

Yes, I've changed the title of the post and added a footnote on "is a foundation model".

Announcement

I think SociaLLM has a good chance of getting OpenAI’s “Research into Agentic AI Systems” grant because it addresses both the challenges of the legibility of AI agent's behaviour by making the agent’s behaviour more “human-like” thanks to weight sharing and regularisation techniques/inductive biases described the post, as well as automatic monitoring: detection of duplicity or deception in AI agent's behaviour by comparing agent’s ToMs “in the eyes” of different other interlocutors, building on the work “Collective Intelligence in Human-AI Teams”.

I am looking for co-investigators for this (up to $100k, up to 8 months long) project with hands-on academic or practical experience in DL training (preferably), ML, Bayesian statistics, or NLP. The deadline for the grant application itself is the 20th of January, so I need to find a co-investigator by the 15th of January.

Another requirement for the co-investigator is that they preferably should be in academia, non-profit, or independent at the moment.

I plan to be hands-on during the project in data preparation (cleansing, generation by other LLMs, etc.) and training, too. However, I don’t have any prior experience with DL training, so if I apply for the project alone, this is a significant risk and a likely rejection.

If the project is successful, it could later be extended for further grants or turned into a startup.

If the project is not a good fit for you but you know someone who may be interested, I’d appreciate it a lot if you shared this with them or within your academic network!

Please reach out to me in DMs or at leventov.ru@gmail.com.

In another thread, Marc Carauleanu wrote:

The main worry that I have with regards to your approach is how competitive SociaLLM would be with regards to SOTA foundation models given both (1) the different architecture you plan to use, and (2) practical constraints on collecting the requisite structured data. While it is certainly interesting that your architecture lends itself nicely to inducing self-other overlap, if it is not likely to be competitive at the frontier, then the methods uniquely designed to induce self-other overlap on SociaLLM are likely to not scale/transfer well to frontier models that do pose existential risks. (Proactively ensuring transferability is the reason we focus on an additional training objective and make minimal assumptions about the architecture in the self-other overlap agenda.)

I agree with worries (1) and (2). I think there is a way to de-risk this.

The block hierarchy that is responsible for tracking the local context consists of classic Transformer blocks. Only the user's own history tracking really needs to be an SSM hierarchy because it quickly surpasses the scalability limits of self-attention (also, interlocutor's tracking blocks in private 1-1 or small group chats that can be arbitrarily long, but there is probably no such data available for training). In the public data (such as forums, public chats rooms, Diplomacy and other text games), the interlocutor's history traces would 99% of the time easily fit into 100k symbols, but for the symmetry with user's own state (same weights!) and for having the same representation structure it should mirror the user's own SSM blocks, of course.

With such an approach, the SSM hierarchies could start very small, with only a few blocks or even just a single SSM block (i.e., two blocks in total: one for user's own and one for interlocutor's state), and attach to the middle of the Transformer hierarchy to select from it. However, I think this approach couldn't be just slapped on the tre-trained LLama or another large Transformer LLM model. I suspect the transformer should be co-trained with the SSM blocks to induce the Transformer to make the corresponding representations useful for the SSM blocks. "Pretraining Language Models with Human Preferences" is my intuition pump here.

Regarding the sufficiency and quality of training data, the Transformer hierarchy itself could still be trained on arbitrary texts, as well as the current LLMs. And we can adjust the size of the SSM hierarchies to the amounts of high-quality dialogue and forum data that we are able to obtain. I think this a no-brainer that this design would improve the frontier quality in LLM apps that value personalisation and attunement to the user's current state (psychological, emotional, levels of knowledge, etc.), relative to whatever "base" Transformer model we would take (such as Llama, or any other).

One additional worry is that many of the research benefits of SociaLLM may not be out of reach for current foundation models, and so it is unclear if investing in the unique data and architecture setup is worth it in comparison to the counterfactual of just scaling up current methods.

With this I disagree, I think it's critical for the user state tracking to be energy-based. I don't think there are ways to recapitulate this with auto-regressive Transformer language models (cf. any LeCun's presentation from the last year). There are potential ways to recapitulate this with other language modelling architectures (non-Transformer and non-SSM), but they currently don't hold any stronger promise than SSM, so I don't see any reasons to pick them.

That sounds ambitions and great, thanks for posting. What's a budget estimation for the fine tuning part?

Training this model would cost from 2 times (on a purely 1-1 dialogue data) to ~10-15 times (on chat room and forum data where messages from the most active users tend to be mixed very well) more than the training of the current LLMs.

Current LLAMA 2 was fine tuned like this:

Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB

As per “Llama 2: Open Foundation and Fine-Tuned Chat Models | Research - AI at Meta,” July 2023. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/.

A100 costs about 1$ per hour, see https://vast.ai/pricing . So the cost of this model would be 3.3M-33M usd? This seems affordable for Google, Meta, etc. but for a grant with 100K usd max?

So perhaps, update this project to fine tune existing models. Perhaps, for classification only some BERT like model would do. Like DeBERTa or similar.