We attempted 1.a "diversity measure based on sentence embeddings" and found that for Llama3.2-1B the diversity appears to decay after the cusp value for R; picking R at highest average diversity was a decent heuristic for finding meaningful steering vectors. The Llama model starts to produce highly repetitive output past the cusp. We demonstrated that repetitive completions were considered similar by our chosen sentence embedding model (SentenceTransformer all-mpnet-base-v2). Using "sum of variances" vs "mean of cosine similarities" didn't seem to matter.
Thank you for reading and the suggestions. I enumerate for easier reference:
1. Find max activating examples
2. Understand which vectors are being found
3. Attempt to scale up
4. Finding useful applications once scaled up
For 1. do you mean:
Am I following correctly?