Evaluating the Foundations and Challenges of Deep Reinforcement Learning for Continuous Control: A Critical Review and Conceptual Synthesis
Keywords:
Deep Reinforcement Learning, Markov Decision Processes, Continuous Control, Dynamic ProgrammingAbstract
Reinforcement learning (RL) has emerged as a powerful framework for sequential decision-making in uncertain environments. With the advent of deep learning, Deep Reinforcement Learning (DRL) methods have enabled agents to achieve human-level or even superhuman performance in domains ranging from games to continuous control tasks. However, despite impressive empirical successes, fundamental theoretical and methodological challenges remain — especially when applying DRL to continuous-control domains that were historically addressed by classical methods such as Dynamic Programming and Markov Decision Process (MDP) frameworks. This article critically reviews the foundations of RL, particularly MDP-based formulations and value-based dynamic programming, contrasts them with the empirical DRL paradigm as exemplified in continuous control benchmarks, and identifies key limitations, research gaps, and future directions. Building on classical theory (Bellman, 1957; Puterman, 1994; Sutton & Barto, 2018) and modern empirical work (Duan et al., 2016; Mnih et al., 2015), we offer a conceptual synthesis that highlights the tension between theoretical guarantees and practical performance, data efficiency and sample complexity, stability and reproducibility. We argue for a renewed research focus on bridging theory and practice — emphasising the need for reproducible benchmarks, rigorous evaluation, and extensions of classical stochastic dynamic programming to high-dimensional, non-linear environments. Our discussion outlines a research agenda to strengthen the foundations of DRL for continuous control.
References
Bellman, R. (1957). Dynamic Programming. Princeton University Press.
Duan, Y., Chen, X., Houthooft, R., Schulman, J., & Abbeel, P. (2016). Benchmarking Deep Reinforcement Learning for Continuous Control.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
O’Reilly Media. (n.d.). Reinforcement Learning Explained. O’Reilly Radar.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Towards Data Science. (n.d.). Markov Decision Processes and Bellman Equations. Towards Data Science blog.
Vuppala, S. P., & Malviya, S. (2025). Towards self-learning data pipelines: Reinforcement learning for adaptive ETL optimization. International Journal of Applied Mathematics, 38(8s), 108–121
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.
Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi‑armed bandit problem. Advances in Applied Probability, 27, 1054–1078.
Agre, P. E. (1988). The Dynamic Structure of Everyday Life. Ph.D. dissertation, Massachusetts Institute of Technology.
Agre, P. E., & Chapman, D. (1990). What are plans for? Robotics and Autonomous Systems, 6, 17–34.
Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61.
Albus, J. S. (1981). Brain, Behavior, and Robotics. Byte Books.
Anderson, C. W. (1986). Learning and Problem Solving with Multilayer Connectionist Systems. Ph.D. dissertation, University of Massachusetts, Amherst.
Anderson, C. W. (1987). Strategy learning with multilayer connectionist representations. In Proceedings of the Fourth International Workshop on Machine Learning (pp. 103–114).
Anderson, J. A., Silverstein, J. W., Ritz, S. A., & Jones, R. S. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413–451.
Andreae, J. H. (1963). STELLA: A scheme for a learning machine. In Proceedings of the 2nd IFAC Congress (pp. 497–502). Butterworths.