DeepMind proposes novel way to train safe reinforcement learning AI – VentureBeat

Reinforcement learning agents or AI thats progressively spurred toward goals via rewards (or punishments) form the foundation of self-driving cars, dexterous robots, and drug discovery systems. But because theyre predisposed to explore unfamiliar states, theyre susceptible to whats called the safe exploration problem, wherein they become fixated on unsafe states (like a mobile robot driving into a ditch, say).

Thats why researchers at Alphabets DeepMind investigated in a paper a method for reward modeling that operates in two phases and is applicable to environments in which agents dont know where unsafe states might be. The researchers say their approach not only successfully trains a reward model to detect unsafe states without visiting them, it can correct reward hacking (loopholes in the reward specification) before the agent is deployed even in new and unfamiliar environments.

Interestingly, their work comes shortly after the release of San Francisco-based research firm OpenAIs Safety Gym, a suite of tools for developing AI that respects safety constraints while training and that compares its safety to the extent it avoids mistakes while learning. Safety Gym similarly targets reinforcement learning agents with constrained reinforcement learning, a paradigm that requires AI systems to make trade-offs to achieve defined outcomes.

The DeepMind teams approach encourages agents to explore a range of states through hypothetical behaviors generated by two systems: a generative model of initial states and a forward dynamics model, both trained on data like random trajectories or safe expert demonstrations. A human supervisor labels the behaviors with rewards, and the agents interactively learn policies to maximize their rewards. Only after the agents have successfully learned to predict rewards and unsafe states are they deployed to perform desired tasks.

Above: DeepMinds safe reinforcement learning approach tested on OpenAI Gym, an environment for AI benchmarking and training.

Image Credit: DeepMind

As the researchers point out, the key idea is the active synthesis of hypothetical behaviors from scratch to make them as informative as possible, without interacting with the environment directly. The DeepMind team calls it reward query synthesis via trajectory optimization, or ReQueST, and explains that it generates four types of hypothetical behaviors in total. The first type maximizes the uncertainty of an ensemble of reward models, while the second and third maximize the predicted rewards (to elicit labels for behaviors with the highest information value) and minimize predicted rewards (to surface behaviors for which the reward model might be incorrectly predicting). As for the fourth category of behavior, it maximizes the novelty of trajectories so as to encourage exploration regardless of predicted rewards.

Finally, once the reward model reaches a satisfactory state, a planning-based agent is deployed one that leverages model-predictive control (MPC) to pick actions optimized for the learned rewards. Unlike model-free reinforcement learning algorithms that learn through trial and error, this MPC enables agents to avoid unsafe states by using the dynamics model to anticipate actions consequences.

To our knowledge, ReQueST is the first reward modeling algorithm that safely learns about unsafe states and scales to training neural network reward models in environments with high-dimensional, continuous states, wrote the coauthors of the study. So far, we have only demonstrated the effectiveness of ReQueST in simulated domains with relatively simple dynamics. One direction for future work is to test ReQueST in 3D domains with more realistic physics and other agents acting in the environment.

Read the original here:

DeepMind proposes novel way to train safe reinforcement learning AI - VentureBeat

Related Posts

Comments are closed.