My AI Alignment Research Agenda and Threat Model, right now (May 2023)

Fairly short timelines, mildly fast takeoffs, and medium-high uncertainties. –> Looking for abstractions to help cognition-steering and value-loading. –> Grasping at / reacting to related/scary/FOMO lines of research.

Threat Model

An AI system could be built that’s far smarter than any human or small group of humans. This AI system could use its intelligence to defeat any non-motivation-directing safeguards, and gain control of the world and the future of humanity and other sentient life. Based on the orthogonality thesis, this amount of power would probably not, by default, be directed towards the best interests of humanity and other sentient life. Based on the idea of instrumental convergence, such an AI would destroy everything we value in its quest to fulfill its (dumb-by-default) original goal. This AI may require new insights to build, or could arise by “scaling up” existing ML architectures. This AI may “self-improve” its architecture, or it could get smarter through prosaic “hack more cloud computing power” techniques. In either case, it could start from a position of low capabilities and end up as the most powerful entity on Earth. This all could start within as few as 1.5 years from now, and will probably happen within 10 years, barring nuclear or other catastrophe.

The simplest solution to the above problem would be “don’t build superhuman AGI, at least for the near future”. However, superhuman AGI is likely to be built, on purpose or by accident, by any of a handful of groups with large amounts of computational resources and talented researchers. These groups are generally not monolithic, and contain leaders and employees who disagree (internally, with other orgs, and/or with me) about the best approach to AI alignment. (See the section “The AI Landscape” below).

Imagine if any of these groups got a box, today, that said “Input your alignment solution by USB drive, push button to get a superhuman AGI that runs on that, box expires in 1 week”. According to my threat model, humanity is unlikely to survive longer than 1 week in this scenario. This is despite the wildly varying (often quite good!) alignment-motivations and security-mindsets of these groups. On my view, this is (mainly) because none of these groups has an adequate pre-prepared response for the below “Two Subproblems”.

The Two Subproblems

I forgot where this advice came from, but I followed the tip of “Take a day or so to think through AI alignment, for yourself, from scratch”. I was definitely biased by my previous readings on AI (especially by Yudkowsky and Wentworth), but I basically came away with

Steering Cognition

How do we direct the thought-patterns, goals, and development of an AI system? This is basically the rocket alignment analogy, specifically the “Newtonian mechanics”/“basic physics” part.

For many, the core difficulty and most-important-part of AI alignment is to be able to steer a mind’s cognition at all. If we get this right, we set a lower-bound on the badness of AGI X-risks (while also opening the door to S-risks from solving this subproblem and neglecting the one below, but that’s not the immediate focus).

Determining/Loading Values

If we could aim a superintelligent AI system at anything, what should we aim it at, and how? In the “rocket alignment” analogy, this is basically the flight plan (or the method for creating the flight plan) to get to the moon.

At first, this seems to naturally decompose into “determine values” and “encode values into the AGI”. However, I consider this to be one subproblem, because an AGI could most likely carry out the execution of either or both of those steps. But before those steps is something like “figure out what [a pointer to [the best values for an AGI]] would look like, in enough detail to point an AGI at it and expect things to go well from there.” Due to fragility-of-value, I don’t expect a real-life satisfactory solution to AI alignment to involve a human (that is, an neither-augmented-nor-simulated human) writing down the full Sheet Of Human Values and then plugging it into an AGI. However, we could end up with, say, a reliable theory of / mathematical abstraction for our values, which an AGI could then “fill in the blanks” of through observation.

Theory of Change

If I research the above two subproblems (and/or the items in the section “What I’m Personally Learning/Researching” below), then one or both of the above subproblems will become more-solved. This could be a full end-to-end solution, a theoretical-but-proved plan, a paradigm that can be developed further, or contributions to the work of others. I am eager to help and fairly-agnostic about how.

Furthermore, I think that even if my above “Threat Model” is wrong in one or more key ways, the research I want to do would still be helpful. For example, if AI takeoff speeds were slower, I would still want research-similar-to-mine to be developed and refined quickly. If neural networks were usurped by a new AI paradigm, I would still think research-similar-to-mine could help align the new architectures.

And, of course, if you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly.

What Success Could Look Like

What I’m Personally Learning/Researching



And again: If you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly.

My Current Constraints

These are described in more depth here. They are (in no particular order):

The most important constraint right now (i.e. the only real bottleneck at this time) is funding. With enough funding, I could work full-time on AI alignment, which would include solving or mitigating the other constraints.

Note that I already have a Bachelor’s degree in computer science, a minor in mathematics, and some other AI-related background (see here).

The AI Landscape

Here is how the rest of the AI alignment/safety landscape looks, to me, as of this writing:

Table 1: An Informal Assessment of Potentially-Strategically-Relevant AI and Alignment Organizations, as of late May 2023. (Note: This table may not be up-to-date.)

Organization One-Sentence Summary Are they likely to cause AGI doom, including by accident? Do they care about AGI risk? (This includes investor pressure and disagreements with my risk model!)
OpenAI The most cutting-edge AGI research lab, structured as a sort of nonprofit/for-profit hybrid company, with heavy investment from Microsoft. worrying mostly
Microsoft Tech giant, with large investments of money and cloud computing towards OpenAI. worrying maybe?
Google DeepMind The other most cutting-edge AGI research lab, the AI arm of the search engine giant, and the original developers of the popular framework TensorFlow. worrying mostly
Google/Alphabet Tech giant that owns DeepMind. worrying maybe?
X.AI Elon Musk’s new AI research company. worrying Elon Musk
Meta/Facebook AI The AI arm of the social networking giant, and the original developers of popular framework PyTorch. worrying no(?)
Conjecture EleutherAI alumni + computational resources and funding + security mindset. More than “kinda”, but less than “worrying”. yes
MIRI (Machine Intelligence Research Institute) The original AI alignment nonprofit, founded by Eliezer Yudkowsky, focused on formal research and fieldbuilding. no yes
Orthogonal A new nonprofit built around an idiosyncratic approach, founded by an alumna of the Conjecture-hosted Refine alignment research incubator. kinda yes
ARC (Alignment Research Center) The nonprofit alignment group run by Paul Christiano. no yes
Redwood Research The nonprofit alignment group where Buck Shlegeris works. kinda yes
Ought Product-oriented nonprofit working on factored cognition. not much maybe
Anthropic Well-funded AI company focused on building and aligning large language models. worrying probably
The US government/military The federal government and/or military of the United States of America. not much, but could probably
The Chinese government/military The national government and/or military of the People’s Republic of China. not much, but could maybe
Any large technology company headquartered in China (including Baidu, Alibaba, Tencent, and others) The most cutting-edge companies in China to have large computational resources. not much, but could maybe
Keen Technologies AGI company with >$20M in funding, founded by John Carmack. Never count out Carmack. no(?)
Apart Research Mostly fieldbuilding no probably
SERI MATS Stanford/Berkeley-run program that supports independent alignment researchers like John Wentworth and Jeffrey Ladish. no yes

