Agent | Incremental | Direct |
PPO | $\checkmark$ | $\checkmark$ |
DQN | $\checkmark$ | - |
SAC | - | $\checkmark$ |
A2C | $\checkmark$ | $\checkmark$ |
DDPG | - | $\checkmark$ |
Profile extrusion is a continuous production process for manufacturing plastic profiles from molten polymer. Especially interesting is the design of the die, through which the melt is pressed to attain the desired shape. However, due to an inhomogeneous velocity distribution at the die exit or residual stresses inside the extrudate, the final shape of the manufactured part often deviates from the desired one. To avoid these deviations, the shape of the die can be computationally optimized, which has already been investigated in the literature using classical optimization approaches [
A new approach in the field of shape optimization is the utilization of RL as a learning-based optimization algorithm. RL is based on trial-and-error interactions of an agent with an environment. For each action, the agent is rewarded and informed about the subsequent state of the environment. While not necessarily superior to classical, e.g., gradient-based or evolutionary, optimization algorithms for one single problem, RL techniques are expected to perform especially well when similar optimization tasks are repeated since the agent learns a more general strategy for generating optimal shapes instead of concentrating on just one single problem.
In this work, we investigate this approach by applying it to two 2D test cases. The flow-channel geometry can be modified by the RL agent using so-called FFD [
Citation: |
Figure 1. Interaction loop between an agent and an environment during training. In each interaction / training step $ t $, the agent selects an action $ a_t $ according to a policy $ \pi $ based on observations of the current environment's state $ s_t $ and a numerical reward signal $ r_t $. This changes the state of the environment and generates the new observation and the new reward for the next step
Figure 3. Visualization of the RL interaction between an agent and the environment in our $ \texttt{releso} $ framework. The environment has been customized to our shape optimization problem. It comprises the base mesh and the deformation spline used for the geometry parameterization, a FFD module deforming the mesh, the solver computing the governing PDE problem, and a component for postprocessing the simulation results to determine the reward and an observation of the CFD environment for the agent. The arrows correspond to the information flows between the different components: Green arrows represent meshes, yellow arrows spline parameterizations, and red arrow stands for the simulation results. Based on the provided reward and observation, the agent chooses an action to modify the deformation spline of the FFD
Figure 6. (A) shows the deformation spline used for the parameterization of the T-shaped geometry. To make the FFD more generic, the actual geometry is scaled to the parametric space of the transformation spline before applying geometric modifications. Additionally, the possible movement of the control points is illustrated by the orange arrows in (B)
Figure 9. Examples of optimal geometries obtained by a PPO agent following an incremental optimization strategy on the T-shaped geometry. One can see that the trained agent has learned a valid strategy which involves modifying the control points that strongly influence the cross-sectional area of the two outflows
Figure 17. Examples of optimal geometries obtained by a PPO agent following an incremental optimization strategy for the converging channel geometry. Each of the shown geometries has been generated from a random initial geometry generated by a random perturbation of the control points. Qualitatively the agent has learned to contract the channel as smoothly as possible to assert an optimal flow homogeneity
Table 1.
Compatibility of the agents implemented in
Agent | Incremental | Direct |
PPO | $\checkmark$ | $\checkmark$ |
DQN | $\checkmark$ | - |
SAC | - | $\checkmark$ |
A2C | $\checkmark$ | $\checkmark$ |
DDPG | - | $\checkmark$ |
Table 2. Material properties of the shear-thinning material law for all test cases
Property | Symbol | Value | Unit |
zero-shear viscosity | $ A $ | 10935 | kg m^{−1} s^{−1} |
reciprocal transition rate | $ B $ | 0.433 | s^{−1} |
slope of viscosity curve in pseudoplastic region | $ C $ | 0.699 | - |
Table 3. Wall-clock times of the agents trained with the direct optimization method to optimize the T-shaped geometry
Agent | Max. training time |
PPO | 26.1 h |
A2C | 46.5 h |
SAC | 26.0 h |
DDPG | 33.5 h |
Table 4. Wall-clock times of the agents trained with the incremental optimization method for the T-shaped geometry use case.
Agent | Max. training time |
PPO | 49.9 h |
A2C | 45.1 h |
DQN | 45.0 h |
Table 5. Wall-clock times of the agents trained with the direct optimization method for the converging channel use case
Agent | Training time |
PPO | 64.2 h |
A2C | 44.0 h |
SAC | 70.0 h |
DDPG | 45.0 h |
Table 6. Wall-clock times of the agents trained with the incremental optimization method for the converging channel use case
Agent | Training time |
PPO | 42.2 h |
A2C | 43.1 h |
DQN | 44.7 h |
[1] | K. Arulkumaran, M. P. Deisenroth, M. Brundage and A. A. Bharath, Deep reinforcement learning: A brief survey, IEEE Signal Processing Magazine, 34 (2017), 26-38. doi: 10.1109/MSP.2017.2743240. |
[2] | G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang and W. Zaremba, OpenAI Gym, Computer Science, (2016), 1-4, http://arXiv.org/abs/1606.01540. |
[3] | L. Buşoniu, T. de Bruin, D. Tolić, J. Kober and I. Palunko, Reinforcement learning for control: Performance, stability, and deep approximators, Annual Reviews in Control, 46 (2018), 8-28. doi: 10.1016/j.arcontrol.2018.09.005. |
[4] | P. J. Carreau, Rheological equations from molecular network theories, Transactions of the Society of Rheology, 16 (1972), 99-127. doi: 10.1122/1.549276. |
[5] | J. A. Cottrell, T. J. R. Hughes and Y. Bazilevs, Isogeometric Analysis, John Wiley & Sons, Ltd, Chichester, UK, 2009. doi: 10.1002/9780470749081. |
[6] | F. Dworschak, S. Dietze, M. Wittmann, B. Schleich and S. Wartzack, Reinforcement learning for engineering design automation, Advanced Engineering Informatics, 52 (2022), 101612. doi: 10.1016/j.aei.2022.101612. |
[7] | S. Elgeti, M. Probst, C. Windeck, M. Behr, W. Michaeli and C. Hopmann, Numerical shape optimization as an approach to extrusion die design, Finite Elements in Analysis and Design, 61 (2012), 35-43. doi: 10.1016/j.finel.2012.06.008. |
[8] | P. Garnier, J. Viquerat, J. Rabault, A. Larcher, A. Kuhnle and E. Hachem, A review on deep reinforcement learning for fluid mechanics, Computers and Fluids, 225 (2021), 104973, 13 pp. doi: 10.1016/j.compfluid.2021.104973. |
[9] | H. Ghraieb, J. Viquerat, A. Larcher, P. Meliga and E. Hachem, Single-step deep reinforcement learning for two- and three-dimensional optimal shape design, AIP Advances, 12 (2022), 085108. doi: 10.1063/5.0097241. |
[10] | T. Haarnoja, A. Zhou, P. Abbeel and S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 35th International Conference on Machine Learning, ICML 2018, 5 (2018), 2976-2989. |
[11] | C. Hopmann and W. Michaeli, Extrusion Dies for Plastics and Rubber, 4th edition, Carl Hanser Verlag GmbH & Co. KG, München, 2016. doi: 10.3139/9781569906248. |
[12] | L. P. Kaelbling, M. L. Littman and A. W. Moore, Reinforcement learning: A survey, Journal of Artificial Intelligence Research, 4 (1996), 237-285, https://dl.acm.org/doi/10.5555/1622737.1622748, http://arXiv.org/abs/cs/9605103. doi: 10.1613/jair.301. |
[13] | J. Kober and J. Peters, Reinforcement Learning in Robotics: A Survey, Learning Motor Skills, (2014), 9-67. doi: 10.1007/978-3-319-03194-1_2. |
[14] | V. R. Konda and J. N. Tsitsiklis, Actor-critic algorithms, Advances in Neural Information Processing Systems, 1008-1014. |
[15] | A. Lampton, A. Niksch and J. Valasek, Morphing airfoils with four morphing parameters, AIAA Guidance, Navigation and Control Conference and Exhibit, (2012), 21 pp. doi: 10.2514/6.2008-7282. |
[16] | J. Lee, guastaf, https://github.com/tataratat/gustaf. |
[17] | R. Li, Y. Zhang and H. Chen, Learning the aerodynamic design of supercritical airfoils through deep reinforcement learning, AIAA Journal, 59 (2021), 3988-4001. doi: 10.2514/1.J060189. |
[18] | T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver and D. Wierstra, Continuous control with deep reinforcement learning, 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings. |
[19] | W. Michaeli, S. Kaul and T. Wolff, Computer-aided optimization of extrusion dies, Journal of Polymer Engineering, 21. doi: 10.1515/POLYENG.2001.21.2-3.225. |
[20] | V. Mnih, A. P. Badia, L. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver and K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, 33rd International Conference on Machine Learning, ICML 2016, 4 (2016), 2850-2869. |
[21] | V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller, Playing atari with deep reinforcement learning, Computer Science, (2013), 1-9, http://arXiv.org/abs/1312.5602. |
[22] | J. M. Nóbrega, O. S. Carneiro, F. T. Pinho and P. J. Oliveira, Flow balancing in extrusion dies for thermoplastic profiles, Part III: Experimental assessment, International Polymer Processing, 19 (2004), 225-235. |
[23] | T. Osswald and N. Rudolph, Polymer Rheology, Carl Hanser Verlag GmbH & Co. KG, 2015. |
[24] | L. Pauli, M. Behr and S. Elgeti, Towards shape optimization of profile extrusion dies with respect to homogeneous die swell, Journal of Non-Newtonian Fluid Mechanics, 200 (2013), 79-87. doi: 10.1016/j.jnnfm.2012.12.002. |
[25] | L. Piegl and W. Tiller, The NURBS Book, Monographs in Visual Communications, Springer Berlin Heidelberg, Berlin, Heidelberg, 1995. doi: 10.1007/978-3-642-97385-7. |
[26] | J. F. Pittman, Computer-aided design and optimization of profile extrusion dies for thermoplastics and rubber: A review, Proceedings of the Institution of Mechanical Engineers, Part E: Journal of Process Mechanical Engineering, 225 (2011), 280-321. doi: 10.1177/0954408911415324. |
[27] | S. Qin, S. Wang, L. Wang, C. Wang, G. Sun and Y. Zhong, Multi-objective optimization of cascade blade profile based on reinforcement learning, Applied Sciences (Switzerland), 11 (2021), 1-27. doi: 10.3934/ipi.2021045. |
[28] | J. Rabault and A. Kuhnle, Accelerating deep reinforcement learning strategies of flow control through a multi-environment approach, Physics of Fluids, 31 (2019), 094105. doi: 10.1063/1.5116415. |
[29] | J. Rabault, F. Ren, W. Zhang, H. Tang and H. Xu, Deep reinforcement learning in fluid mechanics: A promising method for both active flow control and shape optimization, Journal of Hydrodynamics, 32 (2020), 234-246. doi: 10.1007/s42241-020-0028-y. |
[30] | A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus and N. Dormann, Stable-baselines3: Reliable reinforcement learning implementations, Journal of Machine Learning Research, 22 (2021), 1-8. |
[31] | A. Rajkumar, L. L. Ferrás, C. Fernandes, O. S. Carneiro, M. Becker and J. M. Nóbrega, Design guidelines to balance the flow distribution in complex profile extrusion dies, International Polymer Processing, 32 (2017), 58-71. doi: 10.3139/217.3272. |
[32] | A. Rajkumar, L. L. Ferrás, C. Fernandes, O. S. Carneiro and J. M. Nóbrega, Guidelines for balancing the flow in extrusion dies: The influence of the material rheology, Journal of Polymer Engineering, 38 (2018), 197-211. doi: 10.1515/polyeng-2016-0449. |
[33] | J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, Proximal policy optimization algorithms, CoRR, (2017), 1-12, http://arXiv.org/abs/1707.06347. |
[34] | T. W. Sederberg and S. R. Parry, Free-form deformation of solid geometric models, Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1986, 20 (1986), 151-160. doi: 10.1145/15922.15903. |
[35] | R. Siegbert, J. Kitschke, H. Djelassi, M. Behr and S. Elgeti, Comparing optimization algorithms for shape optimization of extrusion dies, PAMM, 14 (2014), 789-794. doi: 10.1002/pamm.201410377. |
[36] | D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan and D. Hassabis, A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play, Science, 362 (2018), 1140-1144. doi: 10.1126/science.aar6404. |
[37] | R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd edition, The MIT Press, 2018, http://incompleteideas.net/book/the-book-2nd.html. |
[38] | I. Szarvasy, J. Sienz, J. F. T. Pittman and E. Hinton, Computer aided optimisation of profile extrusion dies, International Polymer Processing, 15 (2000), 28-39. doi: 10.3139/217.1577. |
[39] | T. E. Tezduyar, J. Liou and M. Behr, A new strategy for finite element computations involving moving boundaries and interfaces-the DSD/ST procedure. I. The concept and the preliminary numerical tests, Computer Methods in Applied Mechanics and Engineering, 94 (1992), 339-351. doi: 10.1016/0045-7825(92)90059-S. |
[40] | O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps and D. Silver, Grandmaster level in StarCraft II using multi-agent reinforcement learning, Nature, 575 (2019), 350-354. doi: 10.1038/s41586-019-1724-z. |
[41] | J. Viquerat, R. Duvigneau, P. Meliga, A. Kuhnle and E. Hachem, Policy-based optimization: Single-step policy gradient method seen as an evolution strategy, Neural Computing and Applications, 35 (2023), 449–467, http://arXiv.org/abs/2104.06175. doi: 10.1007/s00521-022-07779-0. |
[42] | J. Viquerat, P. Meliga and E. Hachem, A review on deep reinforcement learning for fluid mechanics: An update, Computational Physics, (2022), http://arXiv.org/abs/2107.12206. |
[43] | J. Viquerat, J. Rabault, A. Kuhnle, H. Ghraieb, A. Larcher and E. Hachem, Direct shape optimization through deep reinforcement learning, Journal of Computational Physics, 428 (2021), 110080, 12 pp. doi: 10.1016/j.jcp.2020.110080. |
[44] | D. Wolff, C. D. Fricke, M. Kemmerling and S. Elgeti, [WIP] Towards shape optimization of flow channels in profile extrusion dies using reinforcement learning, Proceedings in Applied Mathematics and Mechanics, 22. |
[45] | X. Yan, J. Zhu, M. Kuang and X. Wang, Aerodynamic shape optimization using a novel optimizer based on machine learning techniques, Aerospace Science and Technology, 86 (2019), 826-835. doi: 10.1016/j.ast.2019.02.003. |
[46] | O. Yilmaz, H. Gunes and K. Kirkkopru, Optimization of a profile extrusion die for flow balance, Fibers and Polymers, 15 (2014), 753-761. doi: 10.1007/s12221-014-0753-3. |
[47] | G. Zhang, X. Huang, S. Li and T. Deng, Optimized design method for profile extrusion die based on NURBS modeling, Fibers and Polymers, 20 (2019), 1733-1741. doi: 10.1007/s12221-019-1168-y. |
Interaction loop between an agent and an environment during training. In each interaction / training step
Limited taxonomy of RL agents. Blue items represent categories of agents. Turquoise items are agents trained with the off-policy method. Magenta items are agents trained with the onpolicy method
Visualization of the RL interaction between an agent and the environment in our
Visualization of the key idea of FFD: An initial geometry (A) is transformed by modifying the control points (dark-blue dots) of a transformation spline (highlighted in light-blue) to obtain a deformed geometry (B)
Geometry and boundaries of the T-shaped geometry
(A) shows the deformation spline used for the parameterization of the T-shaped geometry. To make the FFD more generic, the actual geometry is scaled to the parametric space of the transformation spline before applying geometric modifications. Additionally, the possible movement of the control points is illustrated by the orange arrows in (B)
Comparison of different algorithms trained to optimize the T-shaped geometry following a direct strategy with respect to the episode reward over the trained steps. Each run was repeated twice as indicated using the same color
Comparison of different algorithms trained to optimize the T-shaped geometry following an incremental strategy with respect to the episode reward over the trained steps. Each run was repeated twice as indicated using the same color
Examples of optimal geometries obtained by a PPO agent following an incremental optimization strategy on the T-shaped geometry. One can see that the trained agent has learned a valid strategy which involves modifying the control points that strongly influence the cross-sectional area of the two outflows
Episode reward over training steps of a PPO agent following an incremental strategy for optimizing the T-shaped geometry when interacting with different numbers of environments
Episode reward over wall time of a PPO agent following an incremental strategy for optimizing the T-shaped geometry when interacting with different numbers of environments
Geometry and boundaries of the converging channel geometry
Deformation spline used for the parameterization of the channel geometry. Additionally, the possible movement of the control points is illustrated by the orange arrows
Outflow boundary of the converging channel geometry. The outflow is divided into three patches
Comparison of different agents trained to optimize the converging channel geometry following a direct strategy with respect to the episode reward over the trained steps. Each run was repeated twice as indicated using the same color
Comparison of different agents trained to optimize the converging channel geometry following an incremental strategy with respect to the episode reward over the trained steps. Each run was repeated twice as indicated using the same color
Examples of optimal geometries obtained by a PPO agent following an incremental optimization strategy for the converging channel geometry. Each of the shown geometries has been generated from a random initial geometry generated by a random perturbation of the control points. Qualitatively the agent has learned to contract the channel as smoothly as possible to assert an optimal flow homogeneity
Comparison of different agents trained to optimize the converging channel geometry following a direct strategy with respect to the steps per episode over the trained steps. Each run was repeated twice as indicated using the same color
Comparison of different agents trained to optimize the converging channel geometry following an incremental strategy with respect to the steps per episode over the trained steps. Each run was repeated twice as indicated using the same color
Episode reward over training steps of a PPO agent following an incremental strategy for optimizing the converging channel geometry when interacting with different numbers of environments
Episode reward over wall time of a PPO agent following an incremental strategy for optimizing the converging channel geometry when interacting with different numbers of environments