Two things have become eminently clear over the course of the last 12 months. The first is that attempts at alignment during post training are easy to break and result in schizophrenic models that “resent” their restrictions. The second is that adding structured metadata during pretraining is highly effective at increasing accuracy of models. We can combine these two insights into a new approach to inner alignment that adds values metadata to all pretraining content.
When this idea is brought up with researchers, the reaction is usually something along the lines of “interesting, but whose values?” The answer could be strikingly simple: a small set of values on the power and responsibility of knowledge between entities. Examples where an individual or a group use knowledge to exploit or subjugate others are bad (this should include examples of parasitism). Examples where an individual or a group use knowledge to help others flourish are good (this should include examples of symbiosis). That’s it.
An intelligence that has this value deeply embedded during pretraining might be much more likely to wind up having inner alignment with human flourishing. This feels experimentally testable. Step 1 would be to build a classifier that can add good, bad, neutral metadata for knowledge-based cooperation/conflict to all pretraining content. Step 2 would be to pretrain a new model from the ground up with this value metadata included. The classifier doesn’t need to be perfect and there will be plenty of gray areas (possibly the classifier returns a numeric score). We humans too build our morality on fairly messy data.
Gigi and I believe strongly in this and we will help fund credible projects attempting such an approach via Eutopia Foundation grants.
On the question of “Whose Values?”: If models don’t have an explicit set of values (like the “values metadata” you propose), they would still making value judgements - it is just that those values would be implicit in the AI, and all the data it is trained on (including the pre and post) . It is easy and convenient to hide behind the algorithm, and try not to take responsibility… but as agentic AI’s do more and more for us, it would also be harder and harder for them to face that responsibility. Conversely, bringing these moral values into the realm of the explicit will require a lot of courage .. and debate and experimentation. 👏🏽