A reinforcement learning approach to airfoil shape optimization … – Nature.com

In the following section, we present the learning capabilities of the DRL agent with respect to optimizing an airfoil shape, trained in our custom RL environment. Different objectives for the DRL agent were tested, gathered into three tasks. In Task 1, the environment is initialized with a symmetric NACA0012 airfoil and successive tests were performed in which the agent must (i) maximize the lift-to-drag ratio L/D, (ii) maximize the lift coefficient Cl, (iii) maximize endurance (C^{3/2}_{l}/C_{d}), and (iv) minimize the drag coefficient Cd. In Task 2, the environment is initialized with a high performing airfoil having high lift-to-drag ratio and the agent must maximize this ratio. The goal is to test if the learning process is sensitive to the initial state of the environment and if higher performing airfoils can potentially be produced by the agent. In Task 3, the environment is initialized with this same higher performing airfoil, but has been flipped along the y axis. Under this scenario, we investigate the impact of initializing the environment with a poor performing airfoil on the agent and determine if the agent is able to modify the airfoil shape to recoup a high lift-to-drag ratio. Overall, these tasks demonstrate the learning capabilities of the DRL agent to meet specified aerodynamic objectives.

Since we are interested in evaluating the drag of the agent-produced airfoils, the viscous mode of Xfoil is used. In viscous flow conditions, Xfoil only requires the user to specify a Reynolds number (Re) and an airfoil angle of attack (alpha). In all tasks, the flow conditions specified in Xfoil were kept constant. A zero-degree angle of attack and Reynolds number equal to 106 were selected to define the design point for the flow conditions. The decision to keep the airfoils angle of attack at a fixed position is motivated by the interpretability of the agents policy. A less constrained problem, in which the agent can modify the angle of attack, would significantly increase the design space, leading to less interpretability of the agents actions. Additionally, the angle of attack is chosen to be fixed at zero in order to easily compare the performance of agent-generated shapes with those found in the literature. The Reynolds number was chosen to represent an airfoil shape optimization problem at speeds under the transonic flow regime15. Hence, given the relatively low Re number chosen, the flow is incompressible over the airfoil, although Xfoil does include some compressibility corrections when approaching transonic regimes (Karman-Tsien compressibility correction,43). All airfoils are thus compared at zero angle of attack.

Two parameters relating to the PPO algorithm in Stable Baselines can be set, namely the discount factor (gamma) and the learning rate. The discount factor impacts how important future rewards are to the current state: (gamma = 0) will favor short-term reward whereas (gamma = 1) aims at maximizing the cumulative reward in the long run. The learning rate controls the amount of change brought to the model: it is a hyperparameter tuning the PPO neural network. For the PPO agent, the learning rate must be withing ([5times 10^{-6}, 0.003]). A study of the effects of the discount factor and learning rate on the learning process was conducted. This study shows that optimal results are found when using a discount factor (gamma = 0.99) and learning rate equal to 0.00025.

In building our custom environment, we have set some parameters to limit the generation of unrealistic shapes by the agent. These parameters help take into account structural considerations as well as limit the size of the action space. For instance, we define limits to the thickness of the produced shape. If the generated shape (resulting from the splines represented by the control points) exhibits a thickness over or under a specified limit value, the agent will receive a poor reward. Regarding the action space, we set bounds for the change in thickness and camber. This allows the agent to search in a restricted action space thus eliminating a great number of unconverged shapes resulting from actions bringing changes to the airfoil shape that are too extreme. These parameters are given in Table2. Moreover, the iterations parameter is the number of times Xfoil is allowed to rerun a calculation for a given airfoil in the event the solver does not converge. Having a high iterations number increases the convergence rate of Xfoil but also increases run times.

The environment is initialized with a symmetric airfoil having (L/D = 0), (C_{l} = 0) and (C_{d} = 0.0054) at (alpha = 0) and (Re = 10^{6}). In a first experiment, the agent is tasked with producing the highest lift-to-drag airfoil, starting from the symmetric airfoil. During each experiment, the agent is trained over a total number of iterations (defined as the total timestep parameter), which are broken down into episodes having a given length (defined as the episode length parameter). The DRL agent is updated (i.e., changes are brought to the neural network parameters) every N steps. At the end of an experiment, several results are produced. Figure7a displays the L/D of the airfoil successively modified by the agent at the end of each episode.

Learning curves for max L/D objective starting with a symmetric airfoil.

In Fig.7a, each dot represents the L/D value of the shape at the end of an episode and the blue line represents the L/D running average value over 40 successive episodes. The maximum L/D obtained over all episodes is also displayed. Settings regarding the total number of iterations, episode length and N steps for the experiment are given above the graph. It can be observed from Fig.7a that starting with a low L/D during early episodes, the L/D at the end of an episode increases with the number of episodes. Though significant variance in the L/D at the end of an episode can be seen, with values ranging between (L/D = -30) and (L/D = 138), the average value however increases and stabilizes around (L/D = 100). This increase in L/D suggests that the agent in able to learn the appropriate modifications to bring to the symmetric airfoil resulting in an airfoil having high lift-to-drag ratio. We are also interested in tracking a score over a whole episode. Here, we arbitrarily define this score as the sum of the L/D of each shape produced during an episode. For instance, if an episode is comprised of 20 iterations, the agent will have the opportunity to modify the shape 20 times thus resulting in 20 L/D values. Summing these values corresponds to the score over one episode. If the agent produces a shape that does not converge in the aerodynamic solver, a value of 0 is added to the score, thus penalizing the score over the episode if the agent produces highly unrealistic shapes. The evolution of the score with the number of episodes played is displayed in Fig.7b.

Figure7b shows the significant increase in the average score at end of episode, signaling that the agent is learning the optimal shape modifications. We can then visualize the best produced shape over the training phase in Fig.8.

Agent-produced airfoil shape having highest L/D over training.

In Fig.8, the red dots are the control points accessible to the agent. The blue curve describing the shape is the spline resulting from these control points. It is interesting to observe that the optimal shape produced shares the characteristics of high lift-to-drag ratio airfoils, such as those found on gliders, having high camber and a drooped trailing edge. Finally, we run the trained agent on the environment over one episode and observe the generated shapes in Fig.9. Starting from the symmetric airfoil, we can notice the clear set of actions taken by the agent to modify the shape to increase L/D. The experiment detailed above was repeated by varying total timesteps, episode lengths and N steps.

Trained agent modifies shape to produce high L/D.

We then proceed to train the agent under different objectives: maximize Cl, maximize endurance and minimize Cd. Associated learning curves and modified shapes can be found in Figures10, 11, 12, 13.

Learning curves for max Cl objective starting with a symmetric airfoil.

Learning curves for max (C^{3/2}_{l}/C_{d}) objective starting with a symmetric airfoil.

For the minimization of Cd objective, the environment is initialized with a symmetric airfoil having Cd = 0.0341. This change in initial airfoil, compared to the previously used NACA0012 is justified by enhanced learning visualizations.

Learning curves for min Cd objective starting with a symmetric airfoil.

Trained agent modifies shape to produce low Cd starting with a low-performance airfoil.

Similarly, the results show a clear learning curve during which both the metric of interest and the score at end of episode increase with the number of episodes. The learning process appears to happen within the first 100 episodes as signaled by the rapid increase in the score and then plateaus, oscillating around an average score value.

A second set of experiments was performed to assess the impact of the initial shape. The environment is initialized with a high performing airfoil (i.e., having a relatively high lift-to-drag ratio) and the agent is tasked with bringing further improvement to this airfoil. We chose this airfoil by investigating the UIUC database41 and selected the airfoil having the highest L/D. This corresponds to the Eppler 58 airfoil (e58-il) having (L/D = 160) at (alpha = 0) and (Re = 10^{6}), displayed in Fig.14. Results for this experiments are displayed in Fig.15.

Eppler 58 high lift-to-drag ratio airfoil.

Learning curves for max L/D objective starting with a high L/D airfoil.

It is interesting to compare the learning curves and average scores achieved when starting with the symmetric airfoil and the high performance airfoil.

In Fig.16, we can observe that for both initial situations there is an increase in the average score during early episodes followed by stagnation, demonstrating the learning capabilities of the agent. However, the plateaued average score reached is significantly higher when the environment is initialized with the high performance airfoil, given that the environment is initialized in an already high-reward region (through the high-performance airfoil). Additionally, it was observed that a slightly higher maximum L/D value could be achieved when starting with the high lift-to-drag ratio airfoil. Overall, Task 1 and Task 2 emphasize the robustness of the RL agent to successfully converge on high L/D airfoils, regardless of the initial shapes (in both experiments, the agent converges on airfoils having (L/D > 160)). The agent-generated airfoil for Task 2 is represented in Fig.21a.

Initial airfoil impact on the learning curve.

For Task 3, the starting airfoil is a version of the Eppler 58 airfoil that has been flipped around the y axis. As such, the starting airfoil has a lift-to-drag ratio opposite of the Eppler 58 (i.e., (L/D = -160)), thus exhibits low aerodynamic performance. The goal for this task is for the agent to modify the shape into a high performing airfoil, having high L/D.

In Fig.17, we display the learning curves associated to the score and L/D value at the end of each episode when the environment is initialized with the flipped e58 airfoil at the beginning of each episode. A noticeable increase in both the score and L/D values between episode 30 and episode 75 can be observed, followed by a plateau region. This demonstrates that the agent is able to learn the optimal policy to transform the poor performing airfoil into a high performing airfoil by bringing adequate changes to the airfoil shape. The agent then applies this learned optimal policy after episode 100. Moreover, the agent is capable of producing airfoils having lift-to-drag ratios equivalent or higher than the Eppler e58 high-performance airfoil, signaling that the initial airfoil observed by the agent does not impact the optimal policy learned by the agent, but rather only delays its discovery (see Figs.15 and 17).

Score and L/D learning curves when starting with a low performance airfoil.

An example of a high L/D shape produced by the DRL agent when starting with the flipped e58 airfoil is displayed in Fig.18. It is interesting to notice that in this situation, the produced airfoil shares previously observed geometric characteristics, such as high camber and a drooped trailing edge, leading to a high L/D value. The trained agent is then run over one episode length in Fig.19. By successively modifying the airfoil shape, we can observe that the agent is able to recover positive L/D values having started with a low performance airfoil. This demonstrate the correctness of the behavior learned by the agent.

Agent-produced airfoil shape when starting with low performance airfoil.

Trained agent modifies shape to produce high L/D starting with a low-performance airfoil.

Finally, the best produced shapes (i.e., those maximizing the metric of interest) for the different objectives and tasks can now be compared, as illustrated in Figs.20 and21.

Best performing agent-produced shapes under different objectives and a symmetric initial airfoil.

Best performing agent-produced shapes under different objectives and an asymmetric initial airfoil.

The results presented above demonstrate that the number of function evaluations (i.e., the number of times Xfoil is run and converges on a new shape proposed by the agent) depends on the task at hand. For instance, around 2,000 function evaluations were needed in Task 2, while 4,000 are needed in Task 1 and around 20,000 were required in Task 3. These differences can be explained by the distance that exists between the starting shape and the optimal shape. In other terms, when starting with the low performing airfoil, the agent has to perform a greater number of successful steps to converge on an optimal shape, whereas when starting with an already high-performance airfoil, the agent is close to an optimal shape and requires fewer Xfoil evaluations to converge on an optimal shape. The number of episodes needed to reach an optimal policy, however, appears to be between 100 and 200 episodes across all tasks. Overall, when averaging across all tasks performed in this research, approximately 10,000 function evaluations were needed for the agent to converge on the optimal policy.

Having trained the RL agent on a given aerodynamic task, the designer can then draw physical insight by observing the actions the agent follows to optimize the airfoil shape. From the results presented in this research, it can be observed that high camber around the leading edge and low thickness around the trailing edge are preferred shapes to maximize L/D, given the flow conditions used here. Observing the various policies corresponding to different aerodynamic tasks, the designer can then make tradeoffs between the different aerodynamic metrics to optimize. Multi-point optimization can be achieved by including in the reward multiple aerodynamic objectives. For example, if the designer seeks to optimize both L/D and Cl, a new definition of the reward could be: (r = (L/D_{current} + Cl_{current})-(L/D_{previous} + Cl_{previous})) (after having normalized L/D and Cl). However, multi-point optimization will decrease interpretability of the agents actions. By introducing multiple objectives in the agents reward, it will become more difficult for the designer to draw insight from shape changes and link those changes to maximizing a specific aerodynamic objective.

The proposed methodology enables to reduce computational costs by leveraging a data-driven approach. Having learned an optimal policy for a given aerodynamic objective, the agent can be used to optimize new shapes, without having to restart the whole optimization process. More specifically, this approach can be used to alleviate the computational burden of problems requiring high-fidelity solvers (when RANS or compressibility are required). For these problems, the DRL agent can quickly find a first optimal solution, using a low-fidelity solver. The solution can then be refined using a higher-fidelity solver and a traditional optimizer. In other words, DRL is used in this context to extract prior experience to speed up the high-fidelity optimization. As such, our approach can speed up the airfoil optimization process by very rapidly offering an initial optimal solution. Similarly to8, our approach can also be used directly for high-fidelity models. To accelerate convergence speeds, the DRL agent is first trained using a low-fidelity solver in order to rapidly learn an optimal policy. The agent is then deployed using a high-fidelity solver. In doing so this approach (i) reduces computational cost by shifting from a low to a high-fidelity solver to speed up the learning process, (ii) is data-efficient as the policy learned by the agent can then be followed for any comparable problem and, (iii) bears some generative capabilities as it does not require any user-provided data.

As reinforcement learning does not rely on any provided database, no preconception of what a good airfoil shape should look like is available to the agent. This results in added design freedom leading the agent to occasionally generate airfoil shapes that can be viewed as unusual to the aerodynamicists eye. In Fig.22, we compare agent-produced shapes to existing airfoils in literature. The focus is not on the agents ability to produce a specific shape for given flow conditions and aerodynamic targets, but rather to illustrate the geometric similarities found on both existing airfoils and artificially-generated shapes. A strong resemblance between the agent-generated and existing airfoils can be observed. This highlights the rationality of the policy learned by the agent: having no preexisting knowledge on fluid mechanics or airfoils, an intelligent agent trained in the presented custom RL environment can generate realistic airfoil shapes.

We compare five existing airfoils to our agent-produced shapes in Fig.22. In Fig.22a and b, we compare the agent-produced shape to Whitcombs supercritical airfoil. The shared flat upper surface, cambered rear and blunt trailing edge can be noticed51. We then compare agent-generated shapes to existing high-lift airfoils. Here also, the geometric resemblance is noticeable, notably the shared high camber.

Airfoil shape comparison between agent-produced shapes and existing airfoils.

Detrimental effects of large episode lengths.

One observation was made when noticing drastic decreases in the average score at the end of episode after a first period of increase. We believe this can be explained by the fact that when the episode length is large, once the agent has learned a policy allowing to quickly (under relatively few iterations) attain high L/D values, the average score will then decrease because the agent reaches the optimal shape before the end of the episode. Within the remaining iterations before the episode ends, the agent continues to modify the shape hoping for higher performance, but reaches a limit where the shape is too extreme for the aerodynamic solver to converge, resulting in a poor reward. This would explain why we can observe on Fig.23 a rapid increase in the score between 0 and 25 episodes, during which the agent explores various shapes and estimates an optimal policy, and a strong decrease in the score following this peak during which the agent follows the determined optimal policy and reaches optimal shapes before the episode ends.

The results presented above demonstrate the ability of a DRL agent to learn how to optimize airfoil shapes, provided a custom RL environment to interact with. We now compare this approach to a classical simplex method, under the same possible action conditions: starting from a symmetric airfoil, the optimizer must successively modify the shape by changing thickness and camber at selected x positions to achieve the highest performing airfoil in terms of L/D.

Here, the optimizer is based on the Nelder-Mead simplex algorithm, capable of finding the minimum of a multivariate function without having to calculate the first or second derivatives52. In this case, the function maps a 3-set of actions, being [select x position, change thickness, change camber] to a -L/D value. More specifically, taking the 3-set of actions as inputs, the function modifies the airfoil accordingly, evaluates the modified airfoil in Xfoil and outputs the associated -L/D. As the optimizer tries to minimize the- -L/D value, it searches for the 3-set that will maximize L/D. Once the optimizer finds the optimal 3-set of actions, the airfoil shape is modified accordingly and the optimizer is rerun on this new modified shape. This defines what we call one optimization cycle. Hence, the optimizer is tasked with the exact same optimization problem as the DRL agent: optimizing the airfoil shape to reach the highest L/D value possible by successively modifying the shape. During each optimization cycle, the optimizer evaluates the function a certain number of times. In Fig.24, we monitor the increase in L/D with the number of function evaluations.

Simplex method approachL/D increase with function evaluations for different starting points.

In the three situations displayed, it can be observed that the value of L/D increases with the number of function evaluations. However, the converged L/D value is significantly lower than values obtained through the DRL approach. For instance, even after 500 optimization cycles (i.e., 500 shape modifications and over 30,000 function evaluations), the optimizer is unable to generate an airfoil having L/D over 70. We know that this value of L/D is not a global optimum, as an L/D of at least 160 can be reached with the Eppler 58 airfoil from the UIUC database41. Thus, it seems that the simplex algorithm has converged on a local minimum. Furthermore, as demonstrated in Fig.24a and c, the converged L/D value found by the optimizer is highly dependent on the initial point. The airfoil shapes generated using the simplex method can be found in Fig.25.

Gradient-free approach generated airfoil shapes.

In Table3, we compare the converged L/D values, number of iterations and run times of the simplex method and DRL approach. In both approaches, the agent or optimizer can modify the airfoil 60 times. Although the number of iterations and run time are lower for the simplex method, the converged L/D value is far lower compared to the DRL approach.

This rapid simplex approach to the airfoil shape optimization problem highlights the benefits and capabilities of the presented DRL approach. First, the DRL approach seems less prone to convergence on local minima, as very high values of L/D can be achieved. Second, once the DRL agent has learned the optimal policy during a training period, it can be applied directly to any new situation whereas the simplex approach will require a whole optimization process for each new scenario encountered.

See the article here:
A reinforcement learning approach to airfoil shape optimization ... - Nature.com

Related Posts

Comments are closed.