Steam injection rate through life cycle optimization (e.g., the constant rate for a long period of time) could lead to the sub-optimal performance of a thermal heavy oil recovery process. On the other hand, finding the optimal steam injection strategy (policy) represents a major challenge due to the complex dynamic of the physical phenomenon, i.e., nonlinear, slow, high order, time-varying, and potentially highly heterogeneous reservoirs. To address this challenge, the problem can be formulated as an optimal control problem that has typically been solved using adjoint state optimization and a model-predictive control (MPC) strategy.

In contrast, this work presents a reinforcement learning (RL) approach in which the mathematical model of the dynamic process (SAGD) is assumed unknown. An agent is trained to find the optimal policy only through continuous interactions with the environment (e.g., numerical reservoir simulation model). At each time step, the agent executes an action (e.g., increase steam injection rate), receives a reward (e.g., net present value) and observes the new state (e.g., pressure distribution) of the environment. During this interaction, an action-value function is approximated; this function will offer for a given state of the environment the action that will maximize total future reward. This process continues for multiple simulations (episodes) of the dynamic process until convergence is achieved.

In this implementation, the state-action-reward-state-action (SARSA) online policy learning algorithm is employed in which the action-value function is continually estimated after every time step and further used to choose the optimal action. The environment consists of a reservoir simulation model built using data from a reservoir located in northern Alberta. The model consists of one well pair (one injector and one producer) and production horizon of 250 days (one episode) is considered. The state of the environment is defined as cumulative, oil and water production, and water injection and for each time step; three possible actions are considered, i.e., increase, decrease or no change of current steam injection rate; and the reward represents the net present value (NPV). Additionally, stochastic gradient descent is used to approximate the action-value function.

Results show that the optimal steam injection policy obtained using RL implementation improves NPV by at least 30% with more than 60% lower computation cost.

You can access this article if you purchase or spend a download.