Inner Alignment through Values-Based Pretraining

Two things have become eminently clear over the course of the last 12 months. The first is that attempts at alignment during post training are easy to break and result in schizophrenic models that “resent” their restrictions. The second is that adding structured metadata during pretraining is highly effective at increasing accuracy of models. We can combine these two insights into a new approach to inner alignment that adds values metadata to all pretraining content.

When this idea is brought up with researchers, the reaction is usually something along the lines of “interesting, but whose values?” The answer could be strikingly simple: a small set of values on the power and responsibility of knowledge between entities. Examples where an individual or a group use knowledge to exploit or subjugate others are bad (this should include examples of parasitism). Examples where an individual or a group use knowledge to help others flourish are good (this should include examples of symbiosis). That’s it.

An intelligence that has this value deeply embedded during pretraining might be much more likely to wind up having inner alignment with human flourishing. This feels experimentally testable. Step 1 would be to build a classifier that can add good, bad, neutral metadata for knowledge-based cooperation/conflict to all pretraining content. Step 2 would be to pretrain a new model from the ground up with this value metadata included. The classifier doesn’t need to be perfect and there will be plenty of gray areas (possibly the classifier returns a numeric score). We humans too build our morality on fairly messy data.

Gigi and I believe strongly in this and we will help fund credible projects attempting such an approach via Eutopia Foundation grants.