An early warning system for novel AI risks

In their recent paper published in Nature Machine Intelligence, researchers from the University of Toronto, the University of Oxford, OpenAI, Anthropic, and Alignment Research Center propose a framework for evaluating general-purpose models against novel threat risks in AI systems. The proposed evaluation portfolio includes evaluations assessing extreme risk factors that could cause harm to humans, as well as aligning potential future development paths with potential novel risk scenarios. This approach is essential to ensuring safe and responsible development of AI systems, particularly those with capabilities to pursue undesired goals. These risks can arise due to incorrect rewards, or from the system’s design or alignment, which could potentially result in causing harm. The proposed evaluation process utilizes a co-authored framework that includes model evaluation, and identifying and addressing potential model malfunctions. To ensure safe development of AI systems, this proposed approach requires technical and institutional support in addition to rigorous evaluations for all possible risk factors. It is essential to build human values into AI to create responsible and aligned models.

Leave a Comment