181 kB

Title: Automated Reinforcement Learning: An Overview

URL Source: https://arxiv.org/html/2201.05000

Published Time: Tue, 10 Mar 2026 00:43:37 GMT

Markdown Content: Joaquin Vanschoren Uzay Kaymak Rui Zhang Yaoxin Wu Wen Song Yingqian Zhang Reza Refaei Afshar, Joaquin Vanschoren, Yaoxin Wu and Yingqian Zhang are with Eindhoven University of Technology, 5600 MB Eindhoven, the Netherlands (e-mails: r.refaei.afshar@tue.nl, j.vanschoren@tue.nl, y.wu2@tue.nl, yqzhang@tue.nl).Uzay Kaymak, is with Jheronimus Academy of Data Science (JADS), ‘s-Hertogenbosch, the Netherlands (e-mail: u.kaymak@ieee.org).Rui Zhang and Wen Song are with Shandong University, 266237 Qingdao, China (e-mails: 202300190245@mail.sdu.edu.cn, wensong@email.sdu.edu.cn).

Abstract

Reinforcement Learning and, recently, Deep Reinforcement Learning are popular methods for solving sequential decision-making problems modeled as Markov Decision Processes. RL modeling of a problem and selecting algorithms and hyper-parameters require careful consideration, as different configurations may entail completely different performances. These considerations are mainly the task of RL experts; however, RL is progressively becoming popular in other fields, such as combinatorial optimization, where researchers and system designers are not necessarily RL experts. Besides, many modeling decisions are typically made manually, such as defining state and action space, size of batches, batch update frequency, and time steps. For these reasons, automating different components of RL is of great importance, and it has attracted much attention in recent years. Automated RL provides a framework in which different components of RL, including MDP modeling, algorithm selection, and hyper-parameter optimization, are modeled and defined automatically. In this article, we present the literature on automated RL (AutoRL), including the recent large language model (LLM) based techniques. We also discuss the recent work on techniques that are not presently tailored for automated RL but hold promise for future integration into AutoRL. Furthermore, we discuss the challenges, open questions, and research directions in AutoRL.

{IEEEImpStatement}

Automated Reinforcement Learning (AutoRL) aims to make reinforcement learning accessible to non-experts by automating complex processes such as MDP modeling, algorithm selection, and hyper-parameter optimization. This paper provides a comprehensive overview of AutoRL, exploring its potential to significantly impact fields like robotics, optimization, and control systems by reducing the need for extensive RL expertise. The advancements discussed in the article will enable broader adoption and application of RL methods, fostering innovation and efficiency in various domains of artificial intelligence and complex decision-making.

{IEEEkeywords}

Reinforcement Learning, AutoRL, Pipeline

1 Introduction

\IEEEPARstart

Reinforcement Learning (RL) is a learning approach in which no prior knowledge of an environment is necessary. In this definition, the environment is any factor surrounding an agent that provides states and applies action signals. An agent learns the optimal behavior known as policy by interacting with the environment. At each decision step, the agent observes the state of the environment, takes an action, and receives a scalar reward value from the environment. Using this reward value, the agent adjusts its policy to maximize the long-term reward. The long-term reward is either the sum of all future rewards or the discounted sum of future rewards to reduce the impact of future actions [119].

RL is a method to solve sequential decision-making problems modeled as Markov Decision Process (MDP). MDPs can be continuous time, infinite horizon, partially observable, or a combination of these properties. Infinite horizon MDPs do not have a goal state, and the system runs forever [119]. The agent learns to maximize its total reward without expecting a goal state. In continuous-time MDPs, unlike discrete time, the decisions are made at every point of time. The formulation of continuous time MDP is similar to discrete MDP. However, these problems are more challenging to solve in general [47]. Partially Observable MDP (POMDP) is a kind of MDP where the environment is partially observable for the agent due to different reasons such as limited sensors or uncertainty in the environment, and that limits the available information [101, 119]. These variants of MDPs could also be modeled and solved by RL, although they need different considerations to define the states, in which the agent has a belief about the environment rather than complete observation. In this article, we focus on discrete-time fully observable MDPs.

To model a problem in the RL framework and solve it accordingly, different components of RL, including MDP modeling, algorithms, and hyper-parameters such as the number of training steps, the structure of policy function, etc., should be determined before starting the learning procedure. In practice, deciding on these components is rarely straightforward. It often requires considerable experience, repeated experimentation, and careful tuning through trial and error. Even small changes in hyper-parameters or network architecture can significantly affect performance, stability, and convergence. As a result, developing an effective RL solution can be both time-consuming and highly sensitive to design choices. This is where AutoRL (Automated Reinforcement Learning) becomes particularly valuable. Instead of relying on manual tuning and expert intuition, AutoRL seeks to automate the selection and optimization of key elements such as algorithms, network structures, reward formulations, and training hyper-parameters. By systematically searching the configuration space, AutoRL can identify combinations that might not be obvious through manual experimentation. The practical importance of AutoRL lies in its ability to reduce development effort while improving robustness and reproducibility. Automated configuration helps mitigate the risk of unstable training caused by poorly chosen parameters and makes RL methods more adaptable to new tasks and environments. This is especially important in real-world applications, such as robotics, autonomous systems, and complex optimization problems, where training instability or inefficient exploration can be costly. Moreover, by lowering the dependency on deep RL expertise, AutoRL makes reinforcement learning more accessible and easier to deploy across different domains.

As a straightforward way of designing states, the exact observation from the environment is used as a state representation, and the agent’s decision is directly sent to the environment as an action. The observations can be considered individually or stacked in batches. Although these states and actions might help the agent to find an optimal or near-optimal policy for some tasks, they are not necessarily the best representations for the states and the actions. In some other tasks, further processing of the environment is required to find a suitable state representation. As a simple example, normalization is usually necessary for the inputs of Neural Networks (NNs), and using raw observation of the environment may produce undesirable results [98]. NNs are strong function approximation tools for complex RL problems. Combining RL and NN has been a breakthrough in Machine Learning (ML) during the last few years. Furthermore, the performance of an RL policy is highly dependent on the RL algorithms and their hyper-parameters. Selecting algorithms and tuning hyper-parameters are normally done by using expert knowledge. However, many iterations are still needed to find the best set of hyper-parameters. In recent years, RL and DRL have gained much attention in communities other than ML, such as chemical engineering [6] and optimization community [72]. For instance, within the optimization community, which studies how to develop algorithms to solve combinatorial optimization problems, various (deep) RL methods have been applied to learn heuristics to solve problems in transportation, manufacturing, and logistics, etc [72]. In these new RL applications, expert knowledge of RL might not be present. Hence, automating an RL procedure to facilitate using this approach for non-experts and other research directions is of great importance.

Figure 1: Solid arrows show the standard RL pipeline, while dashed arrows depict the AutoRL outer loop where evaluation feedback is used to update configurations and iterate until a satisfactory setting is found.

Automated Reinforcement Learning (AutoRL) provides a framework to automatically make appropriate decisions about the settings of an RL procedure before starting the learning. In other words, RL components, including state, action and reward, algorithm selection, and hyper-parameters optimization, are determined through AutoRL, and the best configuration for each component is provided for an RL procedure to solve a task. Figure 1 shows an RL pipeline containing the RL components. To model and solve a problem using RL, the steps of this pipeline are followed, starting from MDP modeling. AutoRL aims to automate different steps of this pipeline and reduce the necessity of expert knowledge. We use AutoRL to emphasize the resemblance with Automated Machine Learning (AutoML) as a framework for automating supervised and unsupervised learning procedures. According to [49], an AutoML pipeline consists of the components of ML frameworks. It starts with data preprocessing, which contains data collection and data cleaning. The next step is feature engineering, including feature extraction, feature construction, and feature selection. After processing the data and defining the features, a model is developed and optimized to perform classification or clustering tasks. Finally, the quality of the model is determined through evaluation. In order to automate this procedure, the output of evaluation is used to configure new settings and follow the steps again. In this way, several iterations might be needed to derive the optimal settings for data preprocessing, feature engineering, and model generation. Although some components of AutoML and AutoRL are similar, AutoML methods cannot necessarily be used in AutoRL because the configuration of the problems and the complexity of the evaluation step are different. AutoRL pipeline builds a pipeline for RL to define and select components of RL automatically. In recent years, by combining Deep Learning and RL, the need for automating the components of DRL has been increased because many modeling decisions are made manually, and even RL experts have to test several different configurations related to state, action, reward, algorithm, and hyper-parameters, to obtain the best definition. Furthermore, experiments show that normally, there is a gap between the impressive results achieved by DRL algorithms in controlled environments and their practical applicability in real-world scenarios [50]. AutoRL can address this gap by encouraging researchers to focus on aspects like sample efficiency, stability, and transferability, fostering the development of DRL techniques that are more applicable and effective in practice.

In this paper, we review relevant work that can be included in an AutoRL framework and elaborate on research challenges and directions in this relatively new research area. For each of the components mentioned above, different approaches in the literature are presented that might help automate the corresponding component. For example, approaches that modify the initial observations of the environment to define a state representation are candidate methods to use in automating states. AutoRL pipeline consists of modeling a particular problem as a sequential decision-making problem and MDP, selecting an appropriate algorithm, and tuning hyper-parameters. These three steps are illustrated in Figure 1 followed by evaluation. Evaluating an RL algorithm is normally performed by tracing reward alteration during training and comparing the final total reward with some baselines. It means accumulating a sequence of rewards in the memory to use in the evaluation phase.

Figure 2: A roadmap of Automated Reinforcement Learning (AutoRL)

To demonstrate the purpose of AutoRL, take a classical optimization problem, Traveling Salesman Problems (TSP), as an example. The first step in solving TSP with DRL is to model the problem as a sequential decision making and determine MDP components. Vanilla TSP is defined as finding a tour with minimum length in a graph where all the nodes are visited exactly once. In a constructive solution approach, an agent starts from the source and walks through nodes until building a tour. At each timestep, the graph, current node, visited nodes, and remaining nodes are the observations of an environment. There are several ways like graph neural networks [91] and structure2vec [27] to convert the observations to state representation, and AutoRL helps to find the best approach among the possible methods for this conversion. The second step is to define an RL algorithm for updating the policy network. In actor-critic methods, the policy network and value network are the two main networks. These networks are updated during training. This description for solving TSP is a specific example and reflects the majority of the current approaches to using DRL to solve TSP. Popular RL algorithms such as A2C [75], PPO [92], ACER [116] and DQN [76] are available to train the policy network. Each of them is useful for particular problems, and it is not easy to determine the best algorithm for a problem. AutoRL aims to provide the opportunity to search among these possible algorithms to find the suitable one. We focus on model-free RL algorithms where the transition probabilities between states are unknown. The third step is to set the hyper-parameters such as network architecture, learning rate, discount factor, etc. AutoRL employs hyper-parameter optimization methods in its framework to derive the optimal hyper-parameters for an algorithm and a set of problem instances.

Exploring AutoRL and relevant work is presented in [83]. The survey covers a wide range of AutoRL techniques, including algorithm selection, hyperparameter tuning, and architecture design. While this survey and our paper share a common focus on AutoRL, they differ in their specific approaches and coverage. For instance, the survey provides a broad, unifying taxonomy that structures AutoRL around key components of the RL pipeline, such as hyperparameter optimization (HPO), where it discusses RL-specific methods like population-based training as well as Bayesian optimization tailored for noisy RL environments. Additionally, it covers neural architecture search (NAS) adapted for RL policies and value functions, with examples including evolutionary NAS for continuous control tasks, and addresses automated algorithm configuration and selection across off-policy and on-policy methods. Furthermore, the survey examines reward function design through techniques like inverse RL and motivated rewards, while also exploring environment design via unsupervised curriculum generation. Overall, it encompasses both model-free and model-based RL paradigms, offering detailed discussions on challenges such as long training times, sample inefficiency, and evaluation in diverse domains like games, robotics, and molecular design. Spanning over 50 pages, the survey includes extensive references and case studies. Moreover, it highlights open problems, for example, creating standardized benchmarks for AutoRL, improving reproducibility, scaling to high-dimensional real-world applications, and integrating multi-fidelity optimization.

In contrast, our paper takes a more focused approach and provides a condensed overview of the key concepts in AutoRL, emphasizing model-free RL methods while largely omitting in-depth treatment of model-based aspects or specialized areas like automated reward and environment design. Besides, we extend our overview to AutoML, learning-to-learn, and automated neural network design techniques that can potentially be adopted for RL and assist RL designers in automating different RL components, offering insights into cross-domain adaptations not as prominently featured in the survey. Therefore, this paper serves as an overview and enlightens future directions rather than a survey. While both papers contribute to the understanding of AutoRL, the extent of their coverage, the level of detail, and the specific aspects emphasized vary, making them potentially complementary resources for those exploring automated reinforcement learning.

To provide a concise roadmap of the AutoRL landscape and illustrate how the subsequent sections relate to its key components, Figure2 summarizes the major directions covered in this overview, including MDP modeling automation, algorithm/learner design, hyperparameter optimization, evaluation, and extensions such as meta-learning, neural architecture search, and LLM-based AutoRL techniques. Compared with [83], which provides a broad survey and taxonomy of AutoRL methods across the full pipeline, our paper is positioned as a focused overview centered on model-free RL. We emphasize the core automation knobs that practitioners most frequently tune in model-free DRL, while summarizing practical challenges that affect reliable automation, such as evaluation sensitivity, seed variance, and the cost of repeated full training. Moreover, we further highlight how ideas from AutoML, learning-to-learn, and neural architecture design can be adopted as reusable tools within the AutoRL outer loop, using this perspective to motivate future research directions rather than exhaustively surveying all subfields.

This paper is organized as follows. In section 2, the works on automating the MDP modeling, including the definition of states, actions, and rewards, are reviewed. In section 3, the process of RL algorithm selection is reviewed. Since algorithm selection is normally intertwined with hyper-parameter optimization, most of the combined work in algorithm selection and hyper-parameter optimization together with different hyper-parameter optimization work are presented in section 4. Section 5 presents recent work in meta-learning that can be leveraged in an RL framework. In section 6, previous work in optimizing and learning neural network architecture is reviewed. Section 7 explores the integration of large language models into AutoRL pipelines. Section 8 discusses limitations and future work in AutoRL. Finally, section 10 concludes the paper.

2 Markov Decision Process Components

Formally speaking, a discrete-time finite horizon MDP is a tuple (S,A,r,T,γ)(S,A,r,T,\gamma), where S S is the state space, A A is the action space, r:S×A→r∈ℝ r:S\times A\rightarrow r\in\mathbb{R} is an immediate reward value denoting the benefit of transition from current state s t∈S s_{t}\in S to the next state s t+1∈S s_{t+1}\in S, and γ\gamma is the discount factor [86]. At each decision moment or discrete timestep t t, an agent interacts with the environment, and its goal is to learn a policy π:S×A→π∈[0,1]\pi:S\times A\rightarrow\pi\in[0,1] that determines a probability value for each action. Following a greedy, ϵ\epsilon-greedy, softmax, or other action selection policies, the agent takes an action according to the probabilities and transitions to the next state. In other words, the agent observes a state s t∈S s_{t}\in S and performs an action a t∼π(⋅∣s t)a_{t}\sim\pi(\cdot\mid s_{t}). Taking an action has two consequences. First, the agent receives a reward value r t r_{t}. Then, the state of the environment transitions from s t s_{t} to a new state s t+1 s_{t+1} based on the transition probability T T. The agent updates the policy during the learning to find the optimal policy that yields the maximum total reward. RL provides an interaction-based framework to solve the MDP and learn the policy π\pi. The AutoRL needs to define the following four components in RL: state, action, reward, and transition probability. The transition probability is mostly unknown for model-free RL problems [104]. Hence, we focus only on state, action, and reward definitions in this section.

2.1 Methods for Automating States

In many classical RL task, such as mountain car, cart pole, and pendulum [119], the agent’s raw observation is treated as the state. However, this observation is not always an efficient representation. In applications with continuous or high-dimensional spaces, the state space can be extremely large, making value function approximation challenging. As a result, learning effective mappings from observations to compact state representations has received significant attention. Existing approaches can be grouped into two categories: (1) methods that rely on expert-designed transformations of raw observations, where hyperparameter tuning and configuration remain manual, and (2) methods that automatically learn state representations, reducing dependence on expert knowledge.

2.1.1 Transferring raw observations to state representation

Manipulating raw observation and constructing new features are widely used for deriving state representation. These methods range from simple approaches such as tile coding applied to linear function approximation methods for problems like n-state random walk [94, 1], to more complex methods like structure2vec [27] and Pointer Networks [111] for graph Combinatorial Optimization Problems (COPs) such as Vertex Cover [60] and TSP [14]. They are mainly employed for expanding the observation to more useful representations with or without taking the final policy into account. For problems like combinatorial optimization [71], robot navigation [7], and real-world business problems like train shunting [84] and online advertising [90, 4], raw observation of the environment may require processes to derive state representation.

Each state consists of a n-dimensional observations vector from the environment, and each observation is a scalar value called a feature. In other words, the observations in an n-dimensional observation space are feature vectors with n entries. In more complex tasks like image classification, where the observation is a matrix rather than a vector, they can be flattened to vectors. Exact observations from the environment in some problems are not sufficient for representing states. As an example, assume a task with a 2-dimensional observation space where the two dimensions are interrelated. In other words, possible actions for the situation in which both dimensions are positive or negative are different from when they have different signs. In this case, using a vector with two entries as a state does not take any interaction between the two variables into account. Since the observations in many environments are represented by numerical values, features can be interpolated to generate new meaningful features. One of the simplest families of features used for interpolation is Polynomial Features[104]. Polynomial features are obtained by modeling the observations as any order-n polynomial. Formally speaking, assume O=(o 1,o 2,…,o k)O=(o_{1},o_{2},...,o_{k}) is an observation vector from the environment. The new state s i s_{i} corresponding to an observation o i o_{i} is defined as:

s i=∏j=1 k o j c i,j s_{i}=\prod^{k}{j=1}{o{j}^{c_{i,j}}}(1)

where c i,j c_{i,j} is an integer denoting the degree of j th j^{th} term of observation in the definition of the i th i^{th} term of state representation. This approach is mainly used for deriving state representation of linear function approximation algorithms when the important interactions between the features are not included in the observations of the environment.

Figure 3: An example of Coarse coding. The resulting feature vector of state s is (0,0,1,0,0,1)(0,0,1,0,0,1).

Coarse Coding is another useful approach for generating features, especially when the observation of the environment is not informative enough [105]. For example, assume a task with two-dimensional state space where each region of the space has its own characteristics. In order to capture pertinent information about the environment, coarse coding introduces some overlapping circles whose status shows the corresponding state of observation. Each observation lies in one or more circles, and circles are called present/absent or active/inactive based on the location of the observation. If the observation lies in i th i^{th} circle, its corresponding value in state s s is 1; otherwise, it is 0. Using this method, the feature vector of states extends to n n binary values. Figure 3 shows an example of coarse coding with n=6 n=6.

Tile Coding[104] is a widely used approach for converting continuous space to discrete which is easier to manage and reduces the complexity of the problem. In tile coding, n n tiling that each has a fixed number of tiles are offset from each other by a uniform amount in each direction. Figure 4 shows an example of tile coding.

Figure 4: An example of tile coding. The active tiles are shown in bold margins. The oval is the original observation space and the point is a sample observation.

With the popularity of Deep Neural Networks (DNNs) in the past few years, feature engineering is mainly performed by DNNs, and researchers focus more on designing the DNNs. Nevertheless, for some problems like COPs, processing the raw representation of problem instances in order to derive effective state representation improves the quality of the solution. In [27], an approach named structure2vec for representing structured data like trees and graphs is introduced. This approach is based on the idea of embedding latent variable models into feature space. A vector for representing a graph is obtained by employing probabilistic kernels to find latent variable models, and a neural network is trained to output the embedding of a graph based on nodes’ attributes. This idea is used in [60] for solving graph-based combinatorial optimization problems such as Minimum Vertex Cover and TSP. The graph embedding network is learned by fitted Q-learning, and the output of the network is used as a greedy policy to incrementally create the solution to the problem.

2.1.2 Automatically defining the state representation

Some approaches in the literature try to automatically find state representation. Adaptive tile coding [118] begins with a single tile covering the entire state space and uses two heuristics to guide splitting. The first monitors the minimum Bellman error and triggers a split when improvement stalls. The second examines action-selection conflicts within a tile and splits it if conflicts exceed a threshold. To allow uneven splits, [67] employs a genetic algorithm to automatically learn tile codings and state abstractions for large state spaces. Starting from a single tile, the GA determines when and where to split. Each individual represents a tiling encoded as a binary decision tree, and its fitness is measured by the performance of an RL agent using that tiling. Mutation operators either shift existing split boundaries or introduce new splits. This method is demonstrated on the Mountain Car and Pole Balancing continuous control tasks [67].

A common strategy for large state spaces is state aggregation, which groups states with similar characteristics to reduce complexity. In [12], state aggregation for Q-learning in continuous domains is achieved by combining Growing Neural Gas (GNG) with Q-learning. GNG, an unsupervised method, incrementally builds a topological representation of the environment using a set of codeword vectors. Each vector represents a region defined by nearest-neighbor quantization, and GNG refines the representation by adjusting or adding units during training, thereby learning state abstractions automatically. Similarly, [5] extends tile coding to discretize continuous state spaces for solving the knapsack problem. A single n n-dimensional tiling converts continuous item values into discrete representations, where each item corresponds to a dimension. Reinforcement learning is then used to automatically determine the number of tiles per dimension. In this formulation, states correspond to items, and actions determine the number of tiles.

2.1.3 Challenges

Approaches for transforming observations to state representation have parameters and settings that properly tuning them would significantly increase the total reward. For instance, although tile coding and coarse coding are useful approaches for handling large and continuous state space, the number of tilings, tiles, and circles have to be determined by the system designer or expert. Furthermore, promising NN-based methods like Pointer Network require appropriate information about the problem instance, which is still the task of the RL designer. Therefore, completely replacing expert knowledge with automation levels is a challenging task in defining state representation. The other challenge is the generalization of proposed methods. An automated method such as structure2vec is a powerful way to represent graphs. This embedding is only useful for graph-based problems, while a large fraction of RL problems are not inherently graph-based. The same issue holds for adaptive tile coding, which requires special adaptations for a particular task. Deriving a generic state representation method would improve an AutoRL pipeline drastically, which is not well studied in the literature.

2.2 Methods for Automating Actions

In many RL tasks, actions are mainly the decisions of the agent that alter the state of the environment. Different types of actions, such as continuous, discrete, multi-dimensional, bounded, or unbounded, would entail policies with different qualities. For example, a continuous action like prices in a dynamic pricing task can be modeled as either continuous or discrete action space. On one hand, continuous actions might be more precise; however, it is not possible to model them with tabular reinforcement learning or function approximation approaches like DQN that consider an output for each action. On the other hand, although discrete actions are easier to model, modeling a continuous space as discrete might be tricky, especially when small changes in the action would have a large impact on the total reward. Therefore, deriving a proper action representation is very important as it is difficult to find the best representation for actions that end up with the best policy. For this reason, automating the definition of action spaces is necessary for many tasks. Action spaces could be a combination of discrete and continuous for multi-dimensional spaces like robot joints in robot navigation problems. For many continuous control tasks such as a pendulum or BipedalWalker [18], the continuous action space could be discretized to represent discrete actions. In this subsection, we review learning actions and discretizing continuous action space separately.

2.2.1 Learning actions

Action representation learning in order to improve action values and policy has become popular in recent years. In [107], the action representation of multi-dimensional action spaces is learned using hyper-graph. Hyper-graph is the generalization of a graph in which each single hyper-edge could contain one or more vertices. In this modeling, the actions are modeled as vertices in hyper-graph, and the goal is to learn the representation of hyper-edges. To achieve this, a parametric function is defined for each hyper-edge whose input is a representation of the states, and the outputs are separate values for each possible action in the action space. If the action space is multi-dimensional, the number of outputs is equal to the cardinality of a Cartesian product of action vertices in the hyper-edge. After receiving a state representation, each parametric function corresponding to a hyper-edge returns a vector for each action, and these vectors are mixed up using a non-parametric fixed function. The output of this mixing function is the Q value for RL. In order to find a good hypergraph for this problem, a rank r r that defines the set of all hyper-edges with the cardinality of at most r r is chosen. The desired hyper-graph is the one that r r is equal to the number of vertices.

An efficient way to represent actions is to model the output of the policy network as a continuous Probability Density Function (PDF). In common practice, Gaussian distribution is used for the policy, and mean and standard deviation are learned during the training [51]. Gaussian distribution has been successful in many tasks with continuous action space. However, infinite support of this distribution might introduce bias in policies obtained from policy gradient algorithms. To solve this issue, Beta distribution is used as the policy PDF instead of Gaussian distribution in [24]. The authors show that using Beta PDF for policy reduces bias whilst the performance is not negatively affected.

One approach to automatically derive the policy distribution is introduced in [108]. According to this work, the policy gradient update rule with parametric distribution functions results in sub-optimal policies. This sub-optimality is in the distribution space, and learning the policy distribution is a solution. For this purpose, distributional policy optimization (DPO) is presented as an update rule that minimizes the distance between the policy and a target distribution. This update rule is shown in Equation 2.

π k+1=Γ(π k−α k∇π d(𝒟 ℐ π 𝓀 π 𝓀,π)|π=π k)\pi_{k+1}=\Gamma(\pi_{k}-\alpha_{k}\nabla_{\pi}d(\mathcal{D^{\pi_{k}}{\mathcal{I}^{\pi{k}}}},\pi)|{\pi}=\pi{k})(2)

where Γ\Gamma is a projection operator onto the set of distributions, d d is a distance measure, 𝒟 ℐ π π\mathcal{D}^{\pi}_{\mathcal{I}^{\pi}} is a distribution over all states, and actions that their advantage value is positive. In order to minimize the distance between two distributions, Implicit Quantile Network [26] is employed by using the Wasserstein distance metric.

2.2.2 Discretizing continuous actions

Continuous action spaces are challenging in many tasks. As mentioned before, some RL algorithms like DQN do not work well for continuous action spaces because they rely on ϵ\epsilon-greedy algorithm, and the best action is required at each step. Finding the best action in a continuous space needs an optimization step for each interaction with the environment, which is intractable. For discrete action spaces, a separate output is considered in the policy network for each action, which is not possible when the action space is continuous, as it needs an infinite number of outputs. Although discretization [58] makes the action space discrete and manageable, it is not suitable for tasks that are very sensitive to small alterations of actions.

Sometimes, continuous action spaces could be transformed to discrete while retaining necessary information for action selection as shown in [106] for on-policy control RL when the domain of all the continuous actions is between -1 and 1. The set of discrete actions for each dimension is 𝒜 i={2j K−1−1}j=0 K−1\mathcal{A}{i}={\frac{2j}{K-1}-1}{j=0}^{K-1}, where K K is the number of discrete actions. The discrete policy is a neural network that outputs a logit L ij L_{ij} for j th j^{th} action in i th i^{th} dimension. For each dimension i i, the logits are combined through a softmax function to compute the probability of choosing each action. This approach is integrated with TRPO and PPO to be evaluated on MuJoCo benchmarks [106].

2.2.3 Challenges

Similar to state representation, approaches for determining action space contain hyper-parameters, and finding appropriate configurations is important. One way to optimize the parameters is through the hyper-parameters optimization module, as depicted by a dashed arrow between actions and hyper-parameter optimization in Figure 1. This is challenging because there might be several action spaces, and each has some parameters. Hence, the search space is relatively large and the optimization procedure is computationally expensive. On the other hand, finding the optimal action space needs many trial and error steps by checking different definitions. The automation level may help decide the policy distribution or discretization approach for continuous actions, which is necessary. Designing the structure of the policy output is normally performed using expert knowledge, which is not always available. This is an interesting research direction that may influence positively on the output of an RL framework.

2.3 Automated Reward Function

The reward function is a key component of an MDP that strongly affects policy quality. For example, in a 2-D grid world, different reward designs yield different behaviors. If only reaching the goal provides reward, the agent may wander since intermediate moves are not penalized. In contrast, a reward based on distance to the goal encourages efficient paths. Designing an effective reward function is therefore crucial and typically requires expert knowledge, often involving trial and error. Automating reward design can help agents discover better policies more efficiently. In general, three main approaches to reward design can be considered for automation, allowing the agent to search for the most suitable representation.

2.3.1 Curriculum Learning

Curriculum learning is useful for training in environments with sparse rewards. In many tasks, such as robot navigation, the search space is large and only the goal state yields a positive reward. Curriculum learning addresses this by starting training from states near the goal and gradually increasing difficulty as the initial states are placed farther away. In [40], a dynamic programming-based curriculum learning method is proposed to address reward sparsity. The approach begins training from states near the goal and continues until the agent demonstrates mastery. Then, new start states are generated via random walks from previously learned states, gradually expanding the search space. The so-called reverse learning strategy effectively mitigates sparse reward challenges. Similarly to [40], the backward learning method in [54] starts from the state in the vicinity of the goal state and increases the distance when the agent demonstrates mastery in solving the problem. Unlike [40], where the state space expands by random walk, the new states are obtained by computing an approximate backward reachable set, which represents all points in the state space that the agent is able to reach a certain region in a fixed and short amount of time.

Although curriculum learning is helpful in solving sparsity, its main application is in goal-searching tasks. Using curriculum learning in COPs is quite challenging. Learning heuristics in these problems might not be adapted with curriculum learning because the optimal solutions are not usually available in advance. Hence, curriculum learning in a RL pipeline is limited to goal-based tasks where the goal state is available.

2.3.2 Bootstrapping

Bootstrapping methods start learning from a pre-defined policy. This policy could be either for a similar task or designed by a human. A typical approach based on bootstrapping is introduced in [97], where the learning process is split into two phases. In the first phase, the robot is controlled by a existing control policy or directly by a human. In the latter case, the robot is navigated by humans in the environment, and it updates the value function during this phase without changing the policy. The second phase is a typical reinforcement learning process, and the robot updates the policy based on the value function, which is initialized in the first phase. In [59], a hybrid EA-DRL algorithm is proposed to address reward sparsity, poor exploration, and unstable convergence. Each individual in the evolutionary population is a DNN actor within the Deep Deterministic Policy Gradient (DDPG) framework [66]. A population of actors is initialized, and each actor’s fitness is defined as the cumulative reward. New generations are produced via crossover, which randomly exchanges weight segments between parents, and mutation, which adds Gaussian noise to network weights. Meanwhile, experiences collected by all actors are stored in a replay buffer and used to train a separate DDPG actor. [17] extends the evolutionary RL framework of [59] by introducing personal replay buffers and redesigned crossover and mutation operators. Each individual maintains a small genetic memory that stores its recent experiences. In the proposed crossover, called Q-filtered distillation, the child’s genetic memory is populated with recent experiences from its parents. The child is initialized with one parent’s weights and trained on this memory, optimizing a loss that blends the parents’ policies. The new mutation operator, proximal mutation, refines Gaussian perturbation by scaling the noise with the summed gradients over a batch of transitions, reducing destructive updates and improving stability. [36] aims to tackle long-horizon RL tasks. A task is decomposed into simpler goal-reaching subtasks, where identifying subgoals is formulated as a shortest-path problem using a learned distance metric. By assigning a reward of −1-1 per step until the goal is reached, the Q-function aligns with shortest-path objectives.

2.3.3 Reward Shaping

Reward shaping involves designing a proxy reward function to maximize expected return. While rewards are often straightforward in tasks like video games, they require careful design in domains such as robot navigation [61] and multi-objective combinatorial optimization problems (COPs) [53]. In such cases, poorly chosen rewards can drastically affect performance, making careful reward design essential. The following sections review examples and applications of reward shaping across different domains.

In imitation learning, the agent cannot obtain any reward. State-action pairs from a target policy are demonstrated to the agent, and the goal is to mimic the target policy using these state-action pairs. In [57], a shaped reward function is provided to the agent, which is not necessarily aligned with the target policy. The policy is a parametric function with parameters updated by maximizing the reward. In [103], reward shaping is studied for Spoken Dialogue Systems (SDS) modeled as a POMDP. Because SDS typically provide only a final reward at the end of a dialogue, the reward signal is sparse. To address this, domain knowledge is used to generate an additional reward via a recurrent neural network (RNN) trained with supervised learning on annotated data.

A well-known reward shaping approach is potential-based reward shaping [78]. Let r(s,a,s′)r(s,a,s^{\prime}) be the immediate reward of taking action a a in state s s and going to state s′s^{\prime}. Potential-based reward shaping employs a function F(s,a,s′)=γΦ(s′)−Φ(s)F(s,a,s^{\prime})=\gamma\Phi(s^{\prime})-\Phi(s) where, γ\gamma is discount factor, and Φ(s)\Phi(s) and Φ(s′)\Phi(s^{\prime}) are potential functions of states s s and s′s^{\prime}, respectively. The new reward function is r(s,a,s′)+F(s,a,s′)r(s,a,s^{\prime})+F(s,a,s^{\prime}), which is the sum of the original reward and the potential-based reward shaping. In [44], potential-based reward shaping is analyzed in episodic RL. The authors found that potential-based reward shaping preserves the optimal policy when the goal states are predefined terminal states and the shaped reward is zero in the goal states. They observe that policy invariance is violated for finite horizon domains with multiple terminal states, and propose to set the potential value of terminal states as zero to solve this issue. Learning a potential function Φ(s)\Phi(s) using meta-learning is studied in [132]. The potential function is defined by a neural network, and its parameters are updated using the Model-Agnostic Meta-Learning (MAML) algorithm [39]. The parameters of the networks are trained through running an adapted version of DQN with replay memory. In [23], learned proxy reward functions are proposed for path-following and target-based robot navigation to address sparse binary rewards (i.e., goal is reached or not). The method alternates between two stages: (1) optimizing the parameters of the reward function, and (2) training fixed-architecture actor-critic networks with the learned reward. The reward parameters that yield the highest objective value are selected, and RL is used to train the final actor-critic policy. In [37], the authors learn a parametric function for typical continuous control RL problems such as Ant, Walker2D, and Humanoid[19]. For each problem, a particular parametric reward function is defined, and the same algorithm is used to learn both the parameters of the reward and the policy network. Actor-critic algorithms, including Proximal Policy Optimization (PPO) [92] and Soft Actor-Critic (SAC) [48], are used with parametric reward function, and the method outperforms the same algorithms without parametric reward on the aforementioned tasks. In [3], a reward vector is introduced to optimize reserve prices in real-time ad auctions with sparse binary feedback. The agent uses a policy network to map ad slot features to a distribution over reserve prices, while the environment only indicates whether the slot is sold. To provide a richer signal, the reserve price range is divided into weighted sub-intervals, and the reward is calculated as the inner product of the weight vector with an interval-based reward vector that activates the entry for the chosen interval.

2.3.4 Challenges

As mentioned before, Curriculum learning is appropriate for goal-searching problems. This method fails for path-following problems where the goal state is unknown. Using this method in AutoRL is challenging because the goal state is mainly the solution to the optimization problems. However, the solution to easy instances of a problem might help to solve complex instances, and this generalization is relatively hard but is important progress in AutoRL. Bootstrapping methods require initial policies that are provided either by expert knowledge or by following other learning or optimization methods. Though expert knowledge is helpful, it is not available for most problems. Besides, using any policy other than optimal policy does not help the learning because that biases the value function, and a sub-optimal policy is learned. Nevertheless, finding the optimal approach is time-consuming, and a level of automation could be largely beneficial. Reward shaping methods have some parameters that are tuned before starting the training phase. Similar to parametric state and action representation methods, these hyper-parameter settings highly influence the total reward. Tuning the hyper-parameters is challenging and time-consuming, which makes it an interesting research direction. Another interesting research direction is about the effects of intrinsic rewards and whether they help in solving the problem of sparse rewards [96]. For example, giving a reward when the agent has a lot of options available or when the agent encounters a new situation.

3 Automated Algorithm Selection

When the problem is modeled as a sequential decision-making problem, and the components of MDP are defined, the next step in solving it with RL is to select an appropriate algorithm. One way to reduce the search space is to filter the algorithms based on the class of the problem. For example, if the states and actions are discrete and finite, tabular RL algorithms like typical Q-Learning and SARSA [104] are suitable candidates, and there is no need to search over the class of sophisticated algorithms in DRL. Moreover, if the model of the environment is known, a wide variety of model-based algorithms like dynamic programming could be utilized. Despite having different contexts, algorithm selection approaches in AutoML can provide insights for AutoRL [88]. Most of the work in the domain of RL algorithm selection is intertwined with hyper-parameter optimization. For this reason, we explain combined algorithm selection and hyper-parameter optimization methods in the next section and present a few works only on algorithm selection in this section.

In [29], algorithm selection in supervised learning is modeled as a contextual multi-armed bandit problem. This approach is developed for AutoML. The difference with RL is in using a context vector that contains the information of the dataset. Each decision moment starts with observing the dataset and its feature vector, which is the context vector in contextual multi-armed bandit. Then, Upper Confidence Bound (UCB) and ϵ\epsilon-greedy algorithms are used for learning the values of arms on a set of datasets. The algorithm selection problem in episodic RL tasks is modeled as a multi-armed bandit problem in [63] to decide which RL algorithm is in control for each episode. A set of RL algorithms is given, and the process of algorithm selection is started with an empty trajectory set. At each time step, an algorithm is selected, and it generates a trajectory with a discounted reward according to the policy of the selected algorithm. RL algorithms are selected based on Epochal Stochastic Bandit Algorithm Selection, in which the time scale is divided into epochs of exponential length. The policies of algorithms are only updated at the start of epochs, and they are not changed during the epochs. This way of update handles non-stationarity induced by the algorithm learning. In [110] Adaptive-Network (A-Net), an approach to enhance deep reinforcement learning by enabling agents to select targets dynamically is introduced. A-Net learns an auxiliary task selection policy that adapts based on the current training phase, improving the sample efficiency and overall performance of reinforcement learning models. It addresses the challenge of target selection in multi-task environments, where choosing the right auxiliary task is critical for learning efficiency. The authors demonstrate that A-Net outperforms baseline methods across multiple environments, showing more effective and efficient learning behaviors.

Challenges. Generally speaking, the RL algorithm can be classified according to the type of policy or value function. Through this categorization, the policy or value function could be either tabular or parametric. One initial challenge in RL algorithm selection is to decide between these two. If the state and action spaces are relatively small, tabular methods are more appropriate, whereas these methods do not work for large and continuous space and action spaces. Discretization is another solution that is discussed in section 2.2. After identifying an appropriate class of RL algorithm, selecting an algorithm to learn the policy is another challenge. One way is to treat the algorithm as a hyper-parameter and optimize that in the hyper-parameters optimization module, which is normally followed in AutoML. However, this approach requires a particular hyper-parameters optimization framework because the quality of the algorithm depends on the problem, its MDP modeling, and parameter settings. Furthermore, the number of required timesteps varies for different algorithms, and this makes comparing the algorithms challenging. In sum, selecting the proper RL algorithm for a task is difficult, and it highly depends on the problem.

4 Hyper-Parameter Optimization

An optimal RL configuration for solving a sequential decision-making problem highly depends on promising hyper-parameters settings [16]. Hyper-parameters are fixed during the training, and they are usually set by RL experts prior to starting the interactions, although they might vary at different points of time [77]. For example, the learning rate in policy gradient or value function update formula, discount factor, eligibility trace coefficient, and parameters of a parametric reward shaping method are hyper-parameters. Different tasks require different sets of hyper-parameters, which makes hyper-parameter optimization challenging, and automating this process would be very useful. Many different approaches are developed for automatically optimizing hyper-parameters of supervised learning algorithms that could be adapted with RL to optimize hyper-parameters of RL algorithms. In this section, we first review hyper-parameters optimization approaches and then their applications. At last, the main challenges of optimizing hyper-parameters are elaborated.

4.1 Methods for tuning hyper-parameters

This subsection presents the previous hyper-parameters optimization work categorized by their core methodology.

4.1.1 Gradient Descent

Backpropagation is a main method for training neural networks in which the gradient of the loss function is computed with respect to the weights. This gradient is propagated backward through the network, and the new weights are obtained by a variant of the gradient descent algorithm. The hyper-parameters of gradient descent with momentum, including decay rate and learning rate, are included in the backpropagation algorithm, and they are optimized together with neural network weights in [70]. In [126], meta-gradient descent is used to automatically adapt hyperparameters online. The algorithm is leveraged to self-tune parameters of the actor-critic loss function.

Regression models are used in [99] to develop AutoRL-Sim, a simulation environment designed to address combinatorial optimization tasks such as the Traveling Salesman Problem (TSP), Asymmetric TSP (ATSP), and Sequential Ordering Problem (SOP). AutoRL-Sim automates reinforcement learning (RL) processes using AutoML techniques to optimize parameters like learning rate and discount factor, improving solution accuracy and efficiency. The simulator is built in R, is freely available, and supports post-experiment analysis with graphical outputs, offering users flexibility through both predefined and customizable modules.

4.1.2 Bayesian Optimization

Bayesian Optimization methods, including Sequential Model-based Algorithm Configuration (SMAC), have been very popular in AutoML [52]. These methods are beneficial for optimizing expensive to-evaluate functions such as the performance of supervised learning algorithms. The idea of Bayesian Optimization is extended to RL hyper-parameters optimization in [11]. In this work, RLOpt framework uses Gaussian process regression as a surrogate function and integrates Bayesian optimization with RL. The process of hyper-parameter optimization is modeled as a supervised learning problem where the parameters are the input, and the performance is the target. At each step, the selected parameters are given to the agent, and it learns a policy and returns the performance. Based on the history of (hyper-parameter, performance) tuples, the Gaussian Process is used to select the next hyper-parameter settings. In [22], tuning hyper-parameters of Alpha-Go method [95] using Bayesian Optimization shows improvement in playing strength. In [13], the authors develop a DRL approach to solve a multi-objective order batching problem, where the weights of two objectives in the reward function are tuned with Bayesian Optimization.

4.1.3 Multi-Armed Bandit

A bandit-based hyper-parameter optimization algorithm named Hyperband is proposed in [64]. This algorithm is based on SuccessiveHalving[55], where a space of n n hyper-parameters is searched. At each iteration, all the hyper-parameters configurations are evaluated, and the worst half is removed from further processing. This continues until only one configuration remains. Hyperband employs a bandit-based strategy to efficiently allocate resources during the optimization process. The algorithm utilizes a combination of random sampling and successive halving to explore the hyperparameter space. It starts by allocating resources to a large number of configurations, then discards poorly performing configurations and allocates more resources to promising ones. This process continues iteratively until a single configuration is identified as the best. In [82], a method is introduced for hyper-parameter optimization using population-based bandits, ensuring provable efficiency in an online setting. By combining bandit algorithms with a population-based approach, the method aims to dynamically adapt to changing optimization landscapes, demonstrating improved efficiency in hyper-parameter tuning.

4.1.4 Evolutionary Algorithms

In [38], the parameters of SARSA(λ)SARSA(\lambda) and Q(λ)Q(\lambda) as two RL algorithms based on eligibility traces are optimized using Genetic Algorithm (GA). In this method, a vector containing all hyper-parameters is considered as a chromosome, and the mutation and crossover are performed on this vector. The algorithm is tested on an under-actuated pendulum swing-up, and the authors show that the selected parameters maximize the end performance. Another application of GA for optimizing the hyper-parameters of RL algorithms is presented in [93], where the parameters of Deep Deterministic Policy Gradient (DDPG) with Hindsight Experience Replay (HER) [9] are learned through GA. The target parameters for GA are the discount factor, polyak-averaging coefficient, which is used for updating target networks in algorithms like DQN with different target and main networks, learning rate of actor and critic networks, percentage of time a random action is taken, and Gaussian noise parameters. The concatenation of the binary representation of these parameters builds the chromosomes, and fitness is the inverse of the number of epochs needed to reach close to the maximum success rate for a particular task.

Normally, there are three challenges in the hyper-parameters optimization of DRL. First, dynamic environments require dynamic hyper-parameters, and optimal hyper-parameters settings in one stage might work poorly in another stage. Second, the optimization is not sample-efficient and needs a full training run to test each selected hyper-parameter. Third, dynamic modification of the neural network is not considered in the literature. A joint optimization approach based on evolutionary algorithms that optimizes the agent’s network and its hyper-parameters simultaneously is presented in [42]. In the evolutionary framework, each individual is a DRL agent consisting of a policy and a value network together with RL algorithm parameters. Rollouts of each individual are stored in a shared replay memory to be used as experiences for other agents. Each agent is evaluated by running for at least one episode in the environment, and the mean reward of the agent is used to measure its fitness. After crossover and mutation, all the agents in the environment are trained using the experiences in the shared replay memory. This approach is applied to the TD3 algorithm in the MoJuCo continuous control benchmark. This integration of evolutionary algorithms and neural networks is known as neuroevolution [102].

4.1.5 Greedy Algorithms

In order to optimize the decay rate in algorithms based on eligibility trace like TD(λ)TD(\lambda), a greedy algorithm is proposed in [117]. This algorithm defines λ\lambda as a function of states for RL algorithms with linear function approximation. In each iteration of the policy evaluation algorithm, the agent takes the value of λ\lambda greedily according to the weight vector, observation vector corresponding to the current and the next states, immediate reward, and importance sampling. The intuition of this greedy algorithm is to minimize the error function, which is the difference between the return obtained from the selected λ\lambda and the Monte Carlo return (λ=1\lambda=1).

4.1.6 Reinforcement Learning

Hyper-parameters optimization is modeled as a sequential decision-making problem in [56], and RL is used to find the optimal hyper-parameters. In this framework, the agent learns to explore the space of hyper-parameters of a supervised learning algorithm, and the final parameters minimize error on the validation set. This method was originally designed for supervised learning and works based on training and validation datasets. However, the general idea can be extended to RL algorithms. In the MDP modeling of the hyper-parameters optimization problem, the state of the environment is defined as the meta-features of an input dataset plus the history of evaluated hyper-parameters together with their performance; the action is a value for each hyper-parameter, and the reward is the performance of ML algorithm on the input dataset with the selected hyper-parameters. Upon modeling the problem as MDP, deep Q network algorithm [76] is used for learning the parameters of an LSTM system, and the obtained policy works as a decision-maker to determine optimal hyper-parameters for each dataset.

Normally, the hyper-parameters of an algorithm are optimized once, and they are fixed during the entire run of the algorithm. However, because most of AI algorithms are iterative, the optimal hyper-parameters might change over time. The problem of dynamic algorithm configuration is studied in [15], and RL is used to derive a policy for optimal configuration in each step. In this modeling, states are descriptions of an algorithm A, and actions are assigning particular values to hyper-parameters of A. The reward function depends on the instances drawn from the same contextual MDP. The optimal policy is obtained either by a tabular Q-learning or DQN to select the configuration that has the highest discounted reward. Since instances’ information is part of the state, the optimal configuration might be different for different instances.

4.1.7 Neural Networks

One main challenge of using well-known hyper-parameter optimization in AutoML, such as sequential model-based optimization (SMBO) and SMAC, is the time needed to perform necessary iterations. These iterative algorithms take a remarkable amount of time to optimize the hyper-parameters, and this is highly prohibitive in the RL framework. This challenge is the motivation of developing a neural network for finding a mapping between data and hyper-parameters [20]. The meta-features of the dataset are the input of a Convolutional Neural Network (CNN), and the hyper-parameters of the algorithm are the output. Training the CNN is based on supervised learning using subsets of a large dataset as the training data. The target hyper-parameters used for supervised learning are obtained by Bayesian Optimization.

In [31], hyper-parameters optimization for object tracking algorithms is modeled as RL and Normalized Advantage Functions [45] is used to learn a policy network that receives a state and returns the optimal hyper-parameters settings for a particular object tracking algorithm. Most of these algorithms produce a heat map in the search region showing the probable location of objects. The combination of this heat map, the parameters of the object tracking algorithm, and the appearance features like RGB-color constitute the state, and the reward is the tracking accuracy.

One important research direction in robotics is domain randomization. Domain randomization is a technique that uses a simulation model to provide a policy for a real environment. In other words, when training in a real environment is not practical, domain randomization helps to derive a policy that maximizes the total expected return over a set of MDP obtained from the same distribution. The parameters of this MDP distribution are usually fixed and predefined. However, fixed parameters might not be sufficient for some environments. For this reason, the cross-entropy is used to learn these parameters in [112]. In this method, the policy parameters function is a function of MDP distribution parameters, and optimal policy parameters are derived by PPO. The MDP parameters are acquired by maximizing the discounted return of following the optimal policy in a real environment.

Application of RL in different domains requires special consideration for tuning the hyper-parameters. For instance, in [81], RL is leveraged to solve the Sequential Ordering Problem (SOP) - a variant of TSP with a precedence constraint. Tuning parameters in this work is performed by testing different RL algorithms, including SARSA and Q-learning, different reward definitions, and some different values for ϵ\epsilon in ϵ\epsilon-greedy. This configuration of the algorithm is used to solve SOP.

Challenges. Although several methods are developed to tune hyper-parameters, hyper-parameter optimization for RL algorithms may be computationally expensive because the agent needs to interact with an environment and update its policy continuously for a number of timesteps or until reaching convergence. This is usually a time-consuming and intractable process that needs special considerations. Efficient methods in AutoML, such as Bayesian Optimization, may work well for RL models with small states, actions, and trajectories. However, for complex tasks with large state and action space or long trajectories, existing methods require considerable adaptation to provide the best configuration in a reasonable time. This is a significant challenge in RL hyper-parameter optimization.

One main difference between supervised learning and reinforcement learning is the evaluation criteria because there is no target value like supervised learning datasets to show the desired behavior. Instead, an agent seeks to find a compromise between exploration and exploitation that helps as hints [11].

Multi-fidelity methods [41] do not run the ML for the full budget for every hyper-parameter but only for a limited budget (low fidelity). Then, only promising hyper-parameters are run for longer (high fidelity). Adapting this method to AutoRL is challenging and also interesting because it might reduce the required time budget for optimization.

5 Learning to Learn

Apart from the three main components of an RL framework, levels of automation are also presented in the literature for the procedures that can be placed in more than one component. For example, the gradient descent method is used for updating the parameters of parametric functions like policy or reward. In this section, recent works on automating these kinds of procedures are reviewed. Most of these works are inherently developed for supervised learning. However, the same motivation and requirements hold for RL, which shows interesting research directions for future work.

In [8], the normal gradient descent formula is replaced with a new formula in which a function of gradient value is used rather than the original gradient in the update rule. The new gradient descent update equation is shown in Equation (3).

θ t+1=θ t+g t(∇f(θ t),ϕ)\theta_{t+1}=\theta_{t}+g_{t}(\nabla{f}(\theta_{t}),\phi)(3)

where θ\theta is the parameter of the objective function f(θ)f(\theta). In this formulation, g t(∇f(θ t),ϕ)g_{t}(\nabla{f}(\theta_{t}),\phi) is a function of the gradient of f f with parameters ϕ\phi which is obtained by a recurrent neural network. As shown in Figure 5, the method consists of two neural networks. Function f f known as optimizee is represented by a feed-forward neural network with parameters θ\theta. The gradients of f f are used in function g g, which is represented by an LSTM recurrent neural network. These gradients plus the hidden states of RNN are the input, and g g is the output, which is used in the update rule shown in Equation (3). The method is tested on a class of 10-dimensional quadratic functions and also on MNIST and CIFAR-10 datasets, and the results show the power of using RNN for g g.

Figure 5: The gradient descent using RNN

Selecting RL algorithms is normally performed with expert knowledge or using approaches explained in previous sections. Instead of using existing RL algorithms that perhaps each works well on a certain type of problem, a model is introduced in [115] that can learn an RL algorithm. Specifically, a distribution D D over MDPs is defined, and an RL algorithm is learned in the sense that it performs well on the MDPs drawn from D D. During solving MDP with reinforcement learning, an RNN is trained that its inputs are states, actions, and rewards of the MDP, and the output is the policy. Therefore, a recurrent neural network works as a RL algorithm.

Neural Networks are mainly trained using a variant of Stochastic Gradient Descent (SGD) such as normal SGD, SGD with momentum, or Adam. The performance of these algorithms depends on the selected learning rate, which would be different for different contexts and applications. An automatic framework based on RL for deriving the best learning rate is proposed in [28]. In this approach, a set of features is introduced to represent the states in the RL modeling. These features include the variance and the gradient of the loss function. The state representation is used to train a policy using the Relative Entropy Policy Search (REPS) algorithm [85] for deciding the learning rate of a particular optimizer. REPS ensures the policy updates are close to each other by constraining the updates through a bound on Kullback-Leibler (KL) divergence.

A general meta-learning approach that can be applied to any model learned by gradient descent, is presented in [39]. The goal of this approach is to update the model’s parameters using a few training steps to produce acceptable results on a new task. For this purpose, a parametric model is defined that aims to work well on tasks drawn from a task distribution. The algorithm starts with random initialization of model parameters. During each iteration, a set of tasks is sampled from the given distribution, and the tasks’ adapted parameters are updated using gradient descent on a particular number of examples. At the end of each iteration, the model’s parameters are updated using the adapted parameters. The paper discusses the application of this method in supervised learning and classification, and a possible extension to RL is also explained. In [80], a meta-learning approach is presented to discover an entire update rule for reinforcement learning algorithms based on a set of environments.

In [35], a reset policy is considered together with the reinforcement learning policy to reset the environment prior to an expensive-to-reset situation. For example, an autonomous car would crash at high speed, and resetting the environment would be very costly. After the crash state, the environment should be reset manually. Based on this example, a reset policy is necessary for some tasks to decrease the number of manual resets. In [35], the off-policy actor-critic method is used to learn policies, where the Q Q values of the main policy and the reset policy are jointly learned. The reset policy takes over selecting an action to abort the episode if its Q value for a particular action taken by the forward policy is lower than a threshold. The safe actions are the reversible sequence of actions where the agent can always undo them.

Meta-gradient descent has recently been explored in defining components of reinforcement learning. In [121], meta-gradient descent is used to discover a customized objective function parametrized by a neural network. In [122], gradient-based meta-learning helps to define the structure of the return and rates of shrinking it. In [62], a meta-learning approach leverages the experiences of many complex agents to learn a low-complexity neural objective function that decides the learning method of future individuals. RL environments provide a framework for training, testing, and evaluating RL agents. Different algorithms work differently in various environments, and modeling an environment and its settings are of high importance in RL design. In [30], the generation of complex tasks through unsupervised environment design is explored. The authors propose a method for creating environments that foster the emergence of diverse skills without explicit supervision, facilitating effective knowledge transfer across tasks.

In [100], the proposed model integrates transfer learning (TL) with automated reinforcement learning (AutoRL) to improve performance in solving combinatorial optimization problems like the Asymmetric Traveling Salesman Problem (ATSP) and the Sequential Ordering Problem (SOP). The authors introduce a new algorithm called Auto_TL_RL, which combines both AutoRL and TL methodologies to enable the transfer of knowledge between tasks. This significantly reduces computational time and improves solution quality. In this paper AutoRL (Automated Reinforcement Learning) is utilized to automatically select the best configurations for reinforcement learning parameters, such as the learning rate and discount factor. AutoRL simplifies the process of conducting RL experiments by reducing manual intervention in parameter tuning.

Challenges. Learning-to-learn for RL is still challenging in practice because it must generalize across tasks while remaining stable and affordable to train. Meta-objectives can easily overfit to a narrow task distribution and transfer poorly when tasks, dynamics, or reward scales shift. Meta-training is also computationally expensive, since each outer-loop update requires many inner-loop rollouts and policy updates, and the cost grows quickly with long horizons and high-dimensional observations. Moreover, the inherent stochasticity of RL can make meta-gradients noisy and unstable, so multi-seed evaluation and consistent experimental protocols are often necessary to avoid selecting brittle meta-initializations.

6 Automating Neural Network Architecture

Combining DNNs and RL introduces several successful algorithms for solving complex problems like COPs and video games. Although using DNNs improves the quality of function approximation, the performance of the DRL algorithms highly depends on the proper structure of DNNs. Different methods have been proposed in the literature for automatically defining the best DNN structures. These methods can be categorized as hyper-parameters optimization; however, we assign a separate section to emphasize their importance.

In [130], a recurrent neural network - the controller - is trained with reinforcement learning where the outputs of this RNN determine the architecture of another neural network - the child network - that is used for prediction. The child network is configured by the controller, and it is trained using a dataset. The obtained performance is used as a reward for training the controller. This approach was originally developed for supervised learning, although deriving the optimal architecture of a neural network can be helpful for other domains.

According to [131], applying the method presented in [130] directly on large datasets is computationally expensive. The solution introduced in [131] is to search on a proxy dataset, which is rather small, and then transfer the learned network architecture to a large dataset. The search process is the same as [130], where an RNN provides the architecture of the child network. The search space in this work contains generic convolutional cells that are expressed in terms of repeated motifs in various CNNs, like a combination of convolutional filter banks. Two types of convolutional cells are introduced with feature maps of the same dimension and half size, respectively. The final CNN architecture is a combination of these cells. The controller receives the output of previous cells and generates the next architecture of the final CNN.

Deriving DNN architecture can be modeled as a sequential decision-making problem, and RL is a suitable approach for solving that. In [10], a meta-modeling algorithm based on RL named MetaQNN is introduced to generate CNN architecture. The process of CNN architecture selection is automated by a Q-learning agent whose goal is to find the best CNN architecture for a particular machine-learning task. The validation accuracy of the given ML dataset is used as the reward value for the agent, and the actions are obtained by following the ϵ\epsilon-greedy algorithm and exploring in a discrete and finite space of layer parameters. This approach shows high performance for image classification tasks. RL is also used as a neural architecture search technique for RL in [73].

Optimizing the parameters of NNs is an interesting research area as those parameters greatly influence the performance. In [120], a DRL approach is presented for automatically learning the value of the learning rate in stochastic gradient descent. Given the model parameters of the neural network and the training samples, the authors use the actor-critic policy gradient method to pick a learning rate through the policy network for the gradient descent algorithm. The state in this modeling is a compact vector of the model parameters to avoid processing all the parameters of large networks. Immediate reward for updating the learning rate generator network is the difference between the loss function of the main model in two consecutive time steps. According to the presented results, automatically deriving the learning rate increases the quality of the prediction model.

In [43], the authors focused on the problem of lifelong learning and how an agent learns the optimal policy of a particular MDP using the information of a sequence of MDPs from the same distribution. Specifically, the problem in this work is to search for an optimal exploration policy that an agent follows during exploration in the environment. Each agent maintains two policies: an exploitation policy, which is task-specific, and an exploration policy, which is shared between all the MDPs drawn from the same distribution. At each time step, each of the two policies provides an action, and the selected action is determined by ϵ\epsilon-greedy algorithm. The exploration policy receives the same reward as the exploitation policy, and a variant of the policy gradient algorithm (REINFORCE or PPO) is used to update the policy. This approach is experimented on some typical RL problem classes, such as Pole Balancing.

Deep neural networks need a huge amount of computation for training, and this computational complexity is prohibitive sometimes. One approach to reduce the intensity of computation is through the quantization of neural networks. Basically, quantization reduces the bitwith of the operations, and it can be used to reduce the bitwidth of layers in neural networks. As accuracy preserving bitwidth may vary across different layers, the problem of learning optimal bitwidth is explored in [34]. In this work, a DRL approach is proposed to determine the bitwidth of each layer. The states comprise static information about layers and dynamic information of network structure during RL training. The actions are the bitwidth of each layer, which is flexible, and the agent can change the quantization of each layer from any bitwidth to any other bitwidth. The reward pertains to the accuracy and a measure of memory and computation cost. In fact, the two objectives of the reward function are preserving accuracy and minimizing bitwidth. Using these definitions of state, action, and reward, the PPO algorithm is used to learn a policy for deriving the bitwidth of each layer of neural networks.

Components, topologies, and hyper-parameters of neural networks are automatically determined in [74] using a neuroevolutionary algorithm in which the neural networks are trained using evolution rather than gradient descent. It can be a potential approach to determine the configuration of neural networks in deep RL algorithms. This method, which is called DeepNEAT, starts with an initial population of DNNs with minimal complexity, and new nodes and edges are added to each chromosome through mutation. Each chromosome contains some nodes, and each node is a layer of the neural network together with its hyper-parameters. To obtain the fitness of each chromosome, it is converted to a DNN, and the DNN is trained for a fixed number of epochs. DeepNEAT is extended to CoDeepNEAT, which decomposes complicated DNN structures into multiple repeated modules. Specifically, in CoDeepNEAT there are two populations of modules and blueprints. Each blueprint chromosome contains pointers to a particular module, and each module chromosome represents a small DNN. During fitness evaluation, the small DNNs corresponding to the pointers of blueprints are combined to build a large DNN. This approach is evaluated on an image captioning task.

Challenges. Automating neural network architectures in deep reinforcement learning is challenging because the search is large and ill-conditioned, evaluations are noisy and sensitive to implementation details, and each candidate often requires expensive full RL training. Speedup methods such as weight sharing and proxy evaluation can reduce cost but may correlate weakly with final performance. Practical deployment further adds hardware constraints such as latency, memory, and energy, turning the problem into a multi-objective trade-off among accuracy, cost, robustness, and generalization across tasks and environments.

7 Large Language Model for AutoRL

Large Language Models (LLMs) are emerging as a practical interface between unstructured knowledge sources, including text, logs, and human intent, and reinforcement learning (RL) pipelines. From an AutoRL perspective, LLMs are particularly valuable when they reduce manual design efforts across the RL stack. Not only in reward shaping but also in algorithm and update-rule design, MDP abstraction, and policy improvement through language-level memory and feedback. In this section, we review RL-centric integrations of LLMs that directly support AutoRL components (as illustrated in Figure1), including reward design, RL algorithm evolution, MDP automation, and LLM-augmented policy learning.

7.1 LLM for Reward Design

Reward design remains a central bottleneck in RL, particularly in scenarios involving sparse feedback and long-horizon credit assignment. Recent work indicates that LLMs can contribute several RL-grounded capabilities to automate this process. First, LLMs can translate natural-language objectives into structured reward templates or executable reward code. For example, Yu et al.[125] map language instructions to reward specifications for robotic skill synthesis, providing a practical route to turn high-level intent into trainable reward signals. Building on this, LLMs can improve credit assignment by converting episodic feedback into more informative intermediate signals. Qu et al.[87] propose LaRe, which leverages LLMs to restructure episodic rewards into intermediate latent rewards, thereby enhancing credit assignment and bridging the gap between linguistic priors and symbolic reward requirements.

Furthermore, LLMs can generate and iteratively revise reward code through an evaluate–refine loop. Ma et al.[69] demonstrate that LLMs can produce reward programs for continuous-control tasks, with iterative refinement yielding reward functions that outperform expert-designed baselines across diverse robotic environments. To extend these capabilities, LLM-based reward design can be strengthened by incorporating richer feedback modalities (e.g., demonstrations and vision-language cues) to mitigate the limitations of text-only specifications. Chen et al.[21] introduce an interactive framework that combines demonstrations with language/vision-language modeling to align reward features and weights more reliably for robotics.

Finally, beyond shaping and templating, reward functions can be discovered as an optimization objective. Lu et al.[68] introduce a bilevel framework that searches for reward functions for embodied RL agents via regret minimization, thereby reducing dependence on handcrafted rewards. These advancements in reward automation set the stage for broader LLM applications in RL algorithm evolution.

7.2 LLM for RL Algorithm Evolution

Building on the automation of reward functions, a core objective of AutoRL is to automate the learning process itself, extending beyond what is learned to how learning occurs. LLMs can facilitate RL algorithm evolution at two levels: discovering update rules and generating training recipes or configurations for outer-loop optimizers. A significant advancement in automated RL algorithm design involves searching directly in the space of learning rules. Oh et al.[79] show that machines can discover state-of-the-art RL update rules that surpass manually designed counterparts across challenging benchmarks, establishing a foundation for algorithm evolution beyond hyper-parameter tuning. In addition, even with a fixed base algorithm family, AutoRL requires effective selections of learning-rate schedules, regularization, normalization, replay/batching details, and other recipe-level elements. This need is reinforced by empirical evidence that RL performance can be highly sensitive to hyper-parameter choices and tuning protocols. Eimer et al.[33] systematically analyze this sensitivity and recommend principled HPO practices for reproducible RL. Within such limited-budget search scenarios, LLMs can serve as proposal models that shape the search space and warm-start outer-loop optimizers. Yang et al.[124] explore LLMs as optimizers through prompting, while Zhang et al.[128] study LLM-guided decisions in hyper-parameter optimization under constrained evaluation budgets. In AutoRL, these LLM proposals can restrict the search to plausible regions and propose better initial configurations, while final selections remain grounded in empirical evaluation.

Overall, this sub-area reframes AutoRL as a human-knowledge-conditioned search over RL algorithms and recipes, where LLMs provide priors and structure, and the AutoRL loop ensures ground-truth validation. This algorithmic flexibility complements LLM-driven efforts in automating MDP components.

7.3 LLM for MDP Automation

Complementing the evolution of RL algorithms, LLMs can automate the construction of MDP components by translating raw problem contexts into RL related abstractions. In complex domains with structured logs or mixed discrete-continuous contexts, LLMs can summarize global system states into representations more suitable for RL. Wang et al.[113] propose LESR, where an LLM generates task-relevant state representation code that incorporates domain priors, thereby improving sample efficiency and downstream policy learning. Moreover, many real-world RL problems demand adherence to validity constraints and operational rules. LLMs can express these as validators or action schemas (e.g., parameterized actions or hierarchical options), enabling AutoRL to search over action abstractions while enforcing feasibility at execution time. Additionally, when interactions involve APIs, simulators, or tool protocols, LLMs can generate wrappers to normalize reset, step, termination, and observation formatting, enhancing the portability of AutoRL pipelines across domains. These MDP enhancements pave the way for LLMs to serve directly as policy learners.

7.4 LLM as Policy Learner

Extending from MDP automation, LLMs can function as policy components, often by integrating language-level planning and memory with RL-style improvement mechanisms. A foundational direction is to use pre-trained language models as general-purpose decision-making backbones, adapting them to interactive environments. Li et al.[65] study how pre-trained language models can be used for interactive decision-making, providing an early blueprint for language-model-based policies in sequential environments. Building on this, Zhang et al.[127] introduce Rememberer, which equips an LLM with long-term experience memory and an RL mechanism for updating it, enabling continual improvement without fine-tuning model parameters. Zhao et al.[129] present ExpeL, where an agent collects experiences, extracts reusable natural-language insights, and applies them as self-generated guidance during inference. Wang et al.[114] propose Voyager, which builds an expanding library of executable skills through an iterative feedback loop, facilitating long-horizon behavior via compositional skill reuse.

Moreover, language priors can be injected into RL more explicitly to improve sample efficiency. Yan et al.[123] treat LLM outputs as action priors and integrate them into RL through Bayesian-style inference, showing that prior knowledge from LLMs can reduce exploration burden and accelerate learning. At the intersection of reward and policy learning, Du et al.[32] show that LLM-suggested goals can guide exploration and pretraining for RL agents, injecting language priors into the process without relying on dense hand-crafted rewards.

7.5 Challenges in LLM Integration

While LLMs offer powerful tools for automating reward design, evolving RL algorithms, constructing MDPs, and enhancing policy learning, their integration into AutoRL pipelines introduces several reliability challenges. Generated elements—such as reward functions, algorithm configurations, MDP abstractions, or policy guidance—may suffer from inconsistencies, incompleteness, or overfitting to specific benchmarks, potentially undermining reproducibility and generalizability across diverse tasks. The outer-loop optimization process becomes particularly vulnerable to variations in prompt engineering and context retrieval, where minor alterations in instructions or supporting evidence can lead to divergent algorithmic choices or suboptimal proposals. Moreover, seamlessly incorporating LLM suggestions into RL workflows demands rigorous verification mechanisms, safety protocols, and resource management strategies, as erroneous outputs could squander computational budgets, yield policies that exploit unintended reward loopholes, or breach operational constraints, thereby offsetting the efficiency gains promised by automation.

8 Limitations and Future Work

Existing AutoRL approaches face several limitations, as highlighted in the literature. A primary limitation is the lack of comparable and reproducible evaluation. Deep RL results are highly sensitive to evaluation protocols which makes fair cross-paper comparisons difficult[50]. This variability also increases the risk that outer-loop search overfits to a specific benchmark setup rather than producing robust improvements[50, 33]. Current AutoRL benchmarks are often limited to simulation suites like OpenAI Gym or DeepMind Control Suite, which feature low-dimensional, dense-reward tasks. Methods tuned on these benchmarks may struggle to generalize to complex, sparse-reward, or partially observable environments. AutoRL can be computationally demanding because it typically requires many full training runs, and the cost grows quickly with longer horizons, high-dimensional observations, or complex action spaces[33]. Therefore, the scalability is a severe challenge for high-dimensional tasks or long-horizon problems.

On the other hand, AutoRL introduces an additional optimization layer on top of standard RL, which increases computational cost. In a conventional RL setup, the training complexity depends on the number of training episodes, the episode length, and the cost of forward and backward passes through the policy and value networks. If K K configurations are explored, in the simplest case, where all candidates are fully trained, the overall computational cost scales approximately linearly with K K. The dominant cost typically arises from repeated RL training. More efficient strategies, such as Bayesian optimization, evolutionary methods with parallel evaluation, and early stopping, can significantly reduce practical overhead and wall-clock time. Although AutoRL increases total computational requirements, it can improve robustness and reduce repeated manual tuning, making the additional cost justifiable in many complex applications.

Moreover, RL performance can vary substantially across random seeds and minor implementation choices, so AutoRL may select configurations that appear strong largely by chance without rigorous multi-seed controls and statistically meaningful reporting[50]. Finally, many AutoRL methods are validated mainly in simulated settings and may overlook real-world constraints such as safety, feasibility, delayed feedback, and operational costs, which are central in deployment[2, 89]. In addition, automation layers introduced for representation or training can bring additional hyper-parameters, which may shift manual effort rather than eliminate it if they are not automated end-to-end[5, 54].

These limitations suggest that progress in AutoRL should be measured not only by peak performance, but also by comparability, efficiency, robustness, and deployment readiness. A first direction is standardized AutoRL benchmarks and protocols that target the automation problem itself. Such benchmarks should fix compute budgets, require multi-seed reporting with uncertainty, and emphasize generalization so that different AutoRL methods can be compared under matched resources[50, 25, 46]. A second direction is budget-aware scalability. Practical AutoRL should reduce the number of full training runs via multi-fidelity optimization, principled early stopping, and reuse of information across trials, which is especially important when search spaces become large[64, 33]. A third direction is robustness under noise: AutoRL objectives and selection rules should account for variance to avoid randomly good selection and improve reproducibility[50, 33]. Beyond these, real-world deployment calls for constraint-aware AutoRL pipelines that integrate safety and feasibility directly into MDP modeling and configuration search, building on safe RL formulations and benchmarks[2, 89]. Deployment also requires bridging simulation-to-reality gaps in embodied settings, where domain shift motivates robustness-oriented evaluation and techniques such as domain randomization[109]. Finally, LLM-assisted AutoRL is a promising direction for reducing manual design effort by generating candidate reward templates, abstractions, and training recipes as proposals that are then validated within the AutoRL outer loop[69]. Since LLM outputs can be inconsistent or biased, future systems should include verification mechanisms such as constraint checks and self-validation, and rely on empirical evaluation to prevent optimizing spurious signals[87, 69].

9 Ethical Considerations or Potential Risks

AutoRL can reduce the need for expert intervention and speed up RL development. However, greater automation also raises important ethical and practical concerns that should not be overlooked. One central issue is reward misspecification. When reward functions are generated or tuned automatically, the system may optimize for signals that improperly reflect the true objective. As a result, the agent can learn behaviors that technically maximize reward but contradict the intended goal. In safety-critical domains such as robotics, healthcare, or autonomous systems, such misalignment may lead to unsafe or undesirable outcomes. Another concern relates to bias in learned representations. Automated state abstraction methods determine which features of the environment are emphasized or ignored. If the training data or optimization process contains biases, these may be embedded in the learned representation, potentially leading to unfair or systematically skewed decisions. Because these representations are often complex and difficult to interpret, identifying and correcting such issues can be challenging. Automation may also encourage unsafe exploration, particularly in real-world environments. For example, evolutionary strategies or curiosity-driven methods can push agents to explore aggressively. Without appropriate constraints, this exploration may result in harmful actions during training. In addition, as more components of RL systems become automated, human oversight may decrease. Automatically generated reward functions or representations can make system behavior harder to interpret and audit. This reduced transparency can complicate accountability, especially in regulated or high-stakes applications.

For these reasons, automating RL should be accompanied by safeguards such as human-in-the-loop validation, safety constraints, fairness assessments, and systematic testing under diverse scenarios. While automation offers clear advantages, responsible deployment requires careful consideration of its broader impacts.

10 Conclusion

In this paper, we have presented recent work in automated reinforcement learning that can be incorporated into an automated RL or DRL pipeline. We also introduce a general AutoRL pipeline suitable for solving sequential decision-making problems. This field is gaining increasing popularity, as a robust and high-quality reinforcement learning pipeline can address many complex tasks while significantly reducing required time and resources. The RL framework is divided into three main components in this paper, with relevant work discussed for each: MDP modeling, algorithm selection, and hyper-parameter optimization. In addition, learning-to-learn methods and neural network architecture automation are covered in separate sections. Through our exploration of the AutoRL literature, we conclude that a concrete and complete pipeline for AutoRL, analogous to those in AutoML, has yet to be fully developed, despite its substantial benefits for designing and solving sequential decision-making problems. Furthermore, AutoRL remains a relatively nascent research area, drawing growing attention. Key research questions include optimizing hyper-parameters with minimal resources, automatically modeling problems as MDPs, and generalizing mappings from available information to RL environments. The integration of large language models (LLMs) into AutoRL pipelines represents a promising direction, as evidenced by recent advancements in LLM-assisted reward design, algorithm evolution, and policy learning, which could further enhance automation and efficiency in complex domains.

References

[1]M. Abdoos, N. Mozayani, and A. L. Bazzan (2014)Hierarchical control of traffic signals using q-learning with tile coding. Applied intelligence 40 (2), pp.201–213. Cited by: §2.1.1.
[2]J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017)Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, Vol. 70. External Links: LinkCited by: §8, §8.
[3]R. R. Afshar, J. Rhuggenaath, Y. Zhang, and U. Kaymak (2021)A reward shaping approach for reserve price optimization using deep reinforcement learning. In 2021 International Joint Conference on Neural Networks (IJCNN), pp.1–8. Cited by: §2.3.3.
[4]R. R. Afshar, J. Rhuggenaath, Y. Zhang, and U. Kaymak (2023)An automated deep reinforcement learning pipeline for dynamic pricing. IEEE Transactions on Artificial Intelligence 4 (3), pp.428–437. Cited by: §2.1.1.
[5]R. R. Afshar, Y. Zhang, M. Firat, and U. Kaymak (2020)A state aggregation approach for solving knapsack problem with deep reinforcement learning. In Asian Conference on Machine Learning, pp.81–96. Cited by: §2.1.2, §8.
[6]T. M. Alabi, N. P. Lawrence, L. Lu, Z. Yang, and R. B. Gopaluni (2023)Automated deep reinforcement learning for real-time scheduling strategy of multi-energy system integrated with post-carbon and direct-air carbon captured system. Applied Energy 333, pp.120633. Cited by: §1.
[7]N. ALTUNTAŞ, E. Imal, N. Emanet, and C. N. Öztürk (2016)Reinforcement learning-based mobile robot navigation. Turkish Journal of Electrical Engineering & Computer Sciences 24 (3), pp.1747–1767. Cited by: §2.1.1.
[8]M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas (2016)Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp.3981–3989. Cited by: §5.
[9]M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2017)Hindsight experience replay. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.5055–5065. Cited by: §4.1.4.
[10]B. Baker, O. Gupta, N. Naik, and R. Raskar (2016)Designing neural network architectures using reinforcement learning. In ICLR (Poster), Cited by: §6.
[11]J. C. Barsce, J. A. Palombarini, and E. C. Martínez (2017)Towards autonomous reinforcement learning: automatic setting of hyper-parameters using bayesian optimization. In 2017 XLIII Latin American Computer Conference (CLEI), pp.1–9. Cited by: §4.1.2, §4.1.7.
[12]M. Baumann and H. K. Buning (2011)State aggregation by growing neural gas for reinforcement learning in continuous state spaces. In 2011 10th International Conference on Machine Learning and Applications and Workshops, Vol. 1, pp.430–435. Cited by: §2.1.2.
[13]M. Beeks, R. R. Afshar, Y. Zhang, R. Dijkman, C. Van Dorst, and S. De Looijer (2022)Deep reinforcement learning for a multi-objective online order batching problem. In Proceedings of the International Conference on Automated Planning and Scheduling, Vol. 32, pp.435–443. Cited by: §4.1.2.
[14]I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio (2016)Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940. Cited by: §2.1.1.
[15]A. Biedenkapp, H. F. Bozkurt, T. Eimer, F. Hutter, and M. Lindauer (2020)Dynamic algorithm configuration: foundation of a new meta-algorithmic framework. In Proceedings of the Twenty-fourth European Conference on Artificial Intelligence (ECAI’20)(Jun 2020), Cited by: §4.1.6.
[16]B. Bischl, M. Binder, M. Lang, T. Pielok, J. Richter, S. Coors, J. Thomas, T. Ullmann, M. Becker, A. Boulesteix, et al. (2023)Hyperparameter optimization: foundations, algorithms, best practices, and open challenges. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 13 (2), pp.e1484. Cited by: §4.
[17]C. Bodnar, B. Day, and P. Lió (2020)Proximal distilled evolutionary reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp.3283–3290. Cited by: §2.3.2.
[18]G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016)Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §2.2.
[19]G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016)OpenAI gym. External Links: arXiv:1606.01540 Cited by: §2.3.3.
[20]B. Chen, K. Zhang, L. Ou, C. Ba, H. Wang, and C. Wang (2020)Automatic hyper-parameter optimization based on mapping discovery from data to hyper-parameters. arXiv preprint arXiv:2003.01751. Cited by: §4.1.7.
[21]L. Chen, N. M. Moorman, and M. C. Gombolay (2025)ELEMENTAL: interactive learning from demonstrations and vision-language models for reward design in robotics. In Proceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research. External Links: LinkCited by: §7.1.
[22]Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, and N. de Freitas (2018)Bayesian optimization in alphago. arXiv preprint arXiv:1812.06855. Cited by: §4.1.2.
[23]H. L. Chiang, A. Faust, M. Fiser, and A. Francis (2019)Learning navigation behaviors end-to-end with autorl. IEEE Robotics and Automation Letters 4 (2), pp.2007–2014. Cited by: §2.3.3.
[24]P. Chou, D. Maturana, and S. Scherer (2017)Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International conference on machine learning, pp.834–843. Cited by: §2.2.1.
[25]K. Cobbe, C. Hesse, J. Hilton, and J. Schulman (2020)Leveraging procedural generation to benchmark reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, Vol. 119, pp.2048–2056. External Links: LinkCited by: §8.
[26]W. Dabney, G. Ostrovski, D. Silver, and R. Munos (2018)Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp.1096–1105. Cited by: §2.2.1.
[27]H. Dai, B. Dai, and L. Song (2016)Discriminative embeddings of latent variable models for structured data. In International conference on machine learning, pp.2702–2711. Cited by: §1, §2.1.1, §2.1.1.
[28]C. Daniel, J. Taylor, and S. Nowozin (2016)Learning step size controllers for robust neural network training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30. Cited by: §5.
[29]H. Degroote, B. Bischl, L. Kotthoff, and P. De Causmaecker (2016)Reinforcement learning for automatic online algorithm selection-an empirical study. ITAT 2016 Proceedings 1649, pp.93–101. Cited by: §3.
[30]M. Dennis, N. Jaques, E. Vinitsky, A. Bayen, S. Russell, A. Critch, and S. Levine (2020)Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in neural information processing systems 33, pp.13049–13061. Cited by: §5.
[31]X. Dong, J. Shen, W. Wang, Y. Liu, L. Shao, and F. Porikli (2018)Hyperparameter optimization for tracking with continuous deep q-learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.518–527. Cited by: §4.1.7.
[32]Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, and J. Andreas (2023)Guiding pretraining in reinforcement learning with large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML), External Links: LinkCited by: §7.4.
[33]T. Eimer, M. Lindauer, and R. Raileanu (2023)Hyperparameters in reinforcement learning and how to tune them. In International Conference on Machine Learning, Proceedings of Machine Learning Research. External Links: LinkCited by: §7.2, §8, §8.
[34]A. Elthakeb, P. Pilligundla, F. Mireshghallah, A. Yazdanbakhsh, S. Gao, and H. Esmaeilzadeh (2019)Releq: an automatic reinforcement learning approach for deep quantization of neural networks. In NeurIPS ML for Systems workshop, 2018, Cited by: §6.
[35]B. Eysenbach, S. Gu, J. Ibarz, and S. Levine (2018)Leave no trace: learning to reset for safe and autonomous reinforcement learning. In ICLR 2018 : International Conference on Learning Representations 2018, Cited by: §5.
[36]B. Eysenbach, R. Salakhutdinov, and S. Levine (2019)Search on the replay buffer: bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems, Vol. 32, pp.15246–15257. Cited by: §2.3.2.
[37]A. Faust, A. Francis, and D. Mehta (2019)Evolving rewards to automate reinforcement learning. arXiv preprint arXiv:1905.07628. Cited by: §2.3.3.
[38]F. C. Fernandez and W. Caarls (2018)Parameters tuning and optimization for reinforcement learning algorithms using evolutionary computing. In 2018 International Conference on Information Systems and Computer Science (INCISCOS), pp.301–305. Cited by: §4.1.4.
[39]C. Finn, P. Abbeel, and S. Levine (2017)Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp.1126–1135. Cited by: §2.3.3, §5.
[40]C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel (2017)Reverse curriculum generation for reinforcement learning. Conference on Robot Learning, pp.482–495. Cited by: §2.3.1.
[41]A. I. Forrester, A. Sóbester, and A. J. Keane (2007)Multi-fidelity optimization via surrogate modelling. Proceedings of the royal society a: mathematical, physical and engineering sciences 463 (2088), pp.3251–3269. Cited by: §4.1.7.
[42]J. K. Franke, G. Köhler, A. Biedenkapp, and F. Hutter (2020)Sample-efficient automated deep reinforcement learning. arXiv preprint arXiv:2009.01555. Cited by: §4.1.4.
[43]F. M. Garcia and P. S. Thomas (2019)A meta-mdp approach to exploration for lifelong reinforcement learning. In Advances in Neural Information Processing Systems, Vol. 32, pp.5691–5700. Cited by: §6.
[44]M. Grześ (2017)Reward shaping in episodic reinforcement learning. In AAMAS ’17 Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp.565–573. Cited by: §2.3.3.
[45]S. Gu, T. Lillicrap, I. Sutskever, and S. Levine (2016)Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp.2829–2838. Cited by: §4.1.7.
[46]C. Gulcehre, Z. Wang, A. Novikov, et al. (2020)RL unplugged: a suite of benchmarks for offline reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), External Links: LinkCited by: §8.
[47]X. Guo, O. Hernández-Lerma, X. Guo, and O. Hernández-Lerma (2009)Continuous-time markov decision processes. Springer. Cited by: §1.
[48]T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018)Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §2.3.3.
[49]X. He, K. Zhao, and X. Chu (2021)AutoML: a survey of the state-of-the-art. Knowledge-Based Systems 212, pp.106622. External Links: ISSN 0950-7051, Document, LinkCited by: §1.
[50]P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018)Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32. Cited by: §1, §8, §8, §8.
[51]A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu (2018)Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselinesCited by: §2.2.1.
[52]F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011)Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp.507–523. Cited by: §4.1.2.
[53]K. Ilavarasi and K. S. Joseph (2014)Variants of travelling salesman problem: a survey. In International Conference on Information Communication and Embedded Systems (ICICES2014), pp.1–7. Cited by: §2.3.3.
[54]B. Ivanovic, J. Harrison, A. Sharma, M. Chen, and M. Pavone (2019)Barc: backward reachability curriculum for robotic reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp.15–21. Cited by: §2.3.1, §8.
[55]K. Jamieson and A. Talwalkar (2016)Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics, pp.240–248. Cited by: §4.1.3.
[56]H. S. Jomaa, J. Grabocka, and L. Schmidt-Thieme (2019)Hyp-rl: hyperparameter optimization by reinforcement learning. arXiv preprint arXiv:1906.11527. Cited by: §4.1.6.
[57]K. Judah, A. Fern, P. Tadepalli, and R. Goetschalckx (2014)Imitation learning with demonstrations and shaping rewards. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 28. Cited by: §2.3.3.
[58]A. Kanervisto, C. Scheller, and V. Hautamäki (2020)Action space shaping in deep reinforcement learning. In 2020 IEEE Conference on Games (CoG), pp.479–486. Cited by: §2.2.2.
[59]S. Khadka and K. Tumer (2018)Evolution-guided policy gradient in reinforcement learning. In Advances in Neural Information Processing Systems, Vol. 31, pp.1188–1200. Cited by: §2.3.2.
[60]E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song (2017)Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pp.6348–6358. Cited by: §2.1.1, §2.1.1.
[61]M. Kim, J. Kim, and J. Park (2023)Automated hyperparameter tuning in reinforcement learning for quadrupedal robot locomotion. Electronics 13 (1), pp.116. Cited by: §2.3.3.
[62]L. Kirsch, S. van Steenkiste, and J. Schmidhuber (2019)Improving generalization in meta reinforcement learning using learned objectives. arXiv preprint arXiv:1910.04098. Cited by: §5.
[63]R. Laroche and R. Feraud (2018)Reinforcement learning algorithm selection. In ICLR 2018 : International Conference on Learning Representations 2018, Cited by: §3.
[64]L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017)Hyperband: a novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18 (1), pp.6765–6816. Cited by: §4.1.3, §8.
[65]S. Li, X. Puig, C. Paxton, Y. Du, C. Wang, L. Fan, T. Chen, D. Huang, E. Akyurek, A. Anandkumar, J. Andreas, I. Mordatch, A. Torralba, and Y. Zhu (2022)Pre-trained language models for interactive decision-making. In Advances in Neural Information Processing Systems, External Links: LinkCited by: §7.4.
[66]T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016)Continuous control with deep reinforcement learning. In ICLR 2016 : International Conference on Learning Representations 2016, Cited by: §2.3.2.
[67]S. Lin and R. Wright (2010)Evolutionary tile coding: an automated state abstraction algorithm for reinforcement learning. In Proceedings of the 8th AAAI Conference on Abstraction, Reformulation, and Approximation, pp.42–47. Cited by: §2.1.2.
[68]R. Lu et al. (2025)Discovery of the reward function for embodied reinforcement learning agents. Nature Communications. External Links: Document, LinkCited by: §7.1.
[69]Y. J. Ma, W. Liang, G. Wang, D. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar (2024)Eureka: human-level reward design via coding large language models. In International Conference on Learning Representations (ICLR), External Links: LinkCited by: §7.1, §8.
[70]D. Maclaurin, D. Duvenaud, and R. Adams (2015)Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pp.2113–2122. Cited by: §4.1.1.
[71]N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev (2020)Reinforcement learning for combinatorial optimization: a survey. arXiv preprint arXiv:2003.03600. Cited by: §2.1.1.
[72]N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev (2021)Reinforcement learning for combinatorial optimization: a survey. Computers & Operations Research 134, pp.105400. Cited by: §1.
[73]Y. Miao, X. Song, J. D. Co-Reyes, D. Peng, S. Yue, E. Brevdo, and A. Faust (2022)Differentiable architecture search for reinforcement learning. In International Conference on Automated Machine Learning, pp.20–1. Cited by: §6.
[74]R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, et al. (2019)Evolving deep neural networks. In Artificial intelligence in the age of neural networks and brain computing, pp.293–312. Cited by: §6.
[75]V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016)Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp.1928–1937. Cited by: §1.
[76]V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015)Human-level control through deep reinforcement learning. Nature 518 (7540), pp.529–533. Cited by: §1, §4.1.6.
[77]A. Mohan, C. Benjamins, K. Wienecke, A. Dockhorn, and M. Lindauer (2023)Autorl hyperparameter landscapes. arXiv preprint arXiv:2304.02396. Cited by: §4.
[78]A. Y. Ng, D. Harada, and S. Russell (1999)Policy invariance under reward transformations: theory and application to reward shaping. In Icml, Vol. 99, pp.278–287. Cited by: §2.3.3.
[79]J. Oh, G. Farquhar, I. Kemaev, D. A. Calian, M. Hessel, L. Zintgraf, S. Singh, H. van Hasselt, and D. Silver (2025)Discovering state-of-the-art reinforcement learning algorithms. Nature. External Links: LinkCited by: §7.2.
[80]J. Oh, M. Hessel, W. M. Czarnecki, Z. Xu, H. P. van Hasselt, S. Singh, and D. Silver (2020)Discovering reinforcement learning algorithms. Advances in Neural Information Processing Systems 33, pp.1060–1070. Cited by: §5.
[81]A. L. Ottoni, E. G. Nepomuceno, M. S. de Oliveira, and D. C. de Oliveira (2020)Tuning of reinforcement learning parameters applied to sop using the scott–knott method. Soft Computing 24 (6), pp.4441–4453. Cited by: §4.1.7.
[82]J. Parker-Holder, V. Nguyen, and S. J. Roberts (2020)Provably efficient online hyperparameter optimization with population-based bandits. Advances in neural information processing systems 33, pp.17200–17211. Cited by: §4.1.3.
[83]J. Parker-Holder, R. Rajan, X. Song, A. Biedenkapp, Y. Miao, T. Eimer, B. Zhang, V. Nguyen, R. Calandra, A. Faust, et al. (2022)Automated reinforcement learning (autorl): a survey and open problems. Journal of Artificial Intelligence Research 74, pp.517–568. Cited by: §1, §1.
[84]E. Peer, V. Menkovski, Y. Zhang, and W. Lee (2018)Shunting trains with deep reinforcement learning. In 2018 ieee international conference on systems, man, and cybernetics (smc), pp.3063–3068. Cited by: §2.1.1.
[85]J. Peters, K. Mulling, and Y. Altun (2010)Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 24. Cited by: §5.
[86]M. L. Puterman (2014)Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §2.
[87]Y. Qu, Y. Jiang, B. Wang, Y. Mao, et al. (2025)Latent reward: LLM-empowered credit assignment in episodic reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), External Links: Link, DocumentCited by: §7.1, §8.
[88]S. Raschka (2018)Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808. Cited by: §3.
[89]A. Ray, J. Achiam, and D. Amodei (2019)Benchmarking safe exploration in deep reinforcement learning. Note: OpenAI External Links: LinkCited by: §8, §8.
[90]R. Refaei Afshar, Y. Zhang, M. Firat, and U. Kaymak (2019)A reinforcement learning method to select ad networks in waterfall strategy. In 11th International Conference on Agents and Artificial Intelligence, ICAART 2019, pp.256–265. Cited by: §2.1.1.
[91]F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008)The graph neural network model. IEEE transactions on neural networks 20 (1), pp.61–80. Cited by: §1.
[92]J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017)Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §2.3.3.
[93]A. Sehgal, H. La, S. Louis, and H. Nguyen (2019)Deep reinforcement learning using genetic algorithm for parameter optimization. In 2019 Third IEEE International Conference on Robotic Computing (IRC), pp.596–601. Cited by: §4.1.4.
[94]A. A. Sherstov and P. Stone (2005)Function approximation via tile coding: automating parameter choice. In International Symposium on Abstraction, Reformulation, and Approximation, pp.194–205. Cited by: §2.1.1.
[95]D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016)Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp.484–489. Cited by: §4.1.2.
[96]S. Singh, A. G. Barto, and N. Chentanez (2004)Intrinsically motivated reinforcement learning. In Proceedings of the 17th International Conference on Neural Information Processing Systems, pp.1281–1288. Cited by: §2.3.4.
[97]W. D. Smart and L. P. Kaelbling (2002)Effective reinforcement learning for mobile robots. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), Vol. 4, pp.3404–3410. Cited by: §2.3.2.
[98]J. Sola and J. Sevilla (1997)Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Transactions on nuclear science 44 (3), pp.1464–1468. Cited by: §1.
[99]G. K. B. Souza and A. L. C. Ottoni (2024)AutoRL-sim: automated reinforcement learning simulator for combinatorial optimization problems. Modelling 5 (3), pp.1056–1083. Cited by: §4.1.1.
[100]G. K. B. Souza, S. O. S. Santos, A. L. C. Ottoni, M. S. Oliveira, D. C. R. Oliveira, and E. G. Nepomuceno (2024)Transfer reinforcement learning for combinatorial optimization problems. Algorithms 17 (2), pp.87. Cited by: §5.
[101]M. T. Spaan (2012)Partially observable markov decision processes. In Reinforcement Learning, pp.387–414. Cited by: §1.
[102]K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen (2019)Designing neural networks through neuroevolution. Nature Machine Intelligence 1 (1), pp.24–35. Cited by: §4.1.4.
[103]P. Su, D. Vandyke, M. Gasic, N. Mrksic, T. Wen, and S. Young (2015)Reward shaping with recurrent neural networks for speeding up on-line policy learning in spoken dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp.417–421. Cited by: §2.3.3.
[104]R. S. Sutton and A. G. Barto (2018)Reinforcement learning: an introduction. MIT press. Cited by: §2.1.1, §2.1.1, §2, §3.
[105]R. S. Sutton (1996)Generalization in reinforcement learning: successful examples using sparse coarse coding. Advances in neural information processing systems, pp.1038–1044. Cited by: §2.1.1.
[106]Y. Tang and S. Agrawal (2020)Discretizing continuous action space for on-policy optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp.5981–5988. Cited by: §2.2.2.
[107]A. Tavakoli, M. Fatemi, and P. Kormushev (2020)Learning to represent action values as a hypergraph on the action vertices. In International Conference on Learning Representations, Cited by: §2.2.1.
[108]C. Tessler, G. Tennenholtz, and S. Mannor (2019)Distributional policy optimization: an alternative approach for continuous control. Advances in Neural Information Processing Systems 32, pp.1352–1362. Cited by: §2.2.1.
[109]J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017)Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), External Links: LinkCited by: §8.
[110]T. Vincent, F. Wahren, J. Peters, B. Belousov, and C. D’Eramo (2024)Adaptive Q Q-network: on-the-fly target selection for deep reinforcement learning. arXiv preprint arXiv:2405.16195. Cited by: §3.
[111]O. Vinyals, M. Fortunato, and N. Jaitly (2015)Pointer networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pp.2692–2700. Cited by: §2.1.1.
[112]Q. Vuong, S. Vikram, H. Su, S. Gao, and H. I. Christensen (2019)How to pick the domain randomization parameters for sim-to-real transfer of reinforcement learning policies?. arXiv preprint arXiv:1903.11774. Cited by: §4.1.7.
[113]B. Wang, Y. Qu, Y. Jiang, J. Shao, C. Liu, W. Yang, and X. Ji (2024)LLM-empowered state representation for reinforcement learning. In Proceedings of the 41st International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, Vol. 235. External Links: LinkCited by: §7.3.
[114]G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar (2024)Voyager: an open-ended embodied agent with large language models. Transactions on Machine Learning Research (TMLR). External Links: LinkCited by: §7.4.
[115]J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick (2016)Learning to reinforcement learn. Cognitive Science. Cited by: §5.
[116]Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas (2016)Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224. Cited by: §1.
[117]M. White and A. White (2016)A greedy approach to adapting the trace parameter for temporal difference learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp.557–565. Cited by: §4.1.5.
[118]S. Whiteson (2010)Adaptive tile coding. In Adaptive Representations for Reinforcement Learning, pp.65–76. Cited by: §2.1.2.
[119]M. Wiering and M. Van Otterlo (2012)Reinforcement learning. Adaptation, learning, and optimization 12, pp.3. Cited by: §1, §1, §2.1.
[120]C. Xu, T. Qin, G. Wang, and T. Liu (2017)Reinforcement learning for learning rate control. arXiv preprint arXiv:1705.11159. Cited by: §6.
[121]Z. Xu, H. P. van Hasselt, M. Hessel, J. Oh, S. Singh, and D. Silver (2020)Meta-gradient reinforcement learning with an objective discovered online. Advances in Neural Information Processing Systems 33, pp.15254–15264. Cited by: §5.
[122]Z. Xu, H. P. van Hasselt, and D. Silver (2018)Meta-gradient reinforcement learning. Advances in neural information processing systems 31. Cited by: §5.
[123]X. Yan, Y. Song, X. Feng, M. Yang, H. Zhang, H. Bou Ammar, and J. Wang (2025)Efficient reinforcement learning with large language model priors. In International Conference on Learning Representations, External Links: LinkCited by: §7.4.
[124]C. Yang et al. (2024)Large language models as optimizers. In International Conference on Learning Representations (ICLR), External Links: LinkCited by: §7.2.
[125]W. Yu et al. (2023)Language to rewards for robotic skill synthesis. In Proceedings of The 7th Conference on Robot Learning, Proceedings of Machine Learning Research. External Links: LinkCited by: §7.1.
[126]T. Zahavy, Z. Xu, V. Veeriah, M. Hessel, J. Oh, H. P. van Hasselt, D. Silver, and S. Singh (2020)A self-tuning actor-critic algorithm. Advances in neural information processing systems 33, pp.20913–20924. Cited by: §4.1.1.
[127]D. Zhang, L. Chen, S. Zhang, H. Xu, Z. Zhao, and K. Yu (2023)Large language models are semi-parametric reinforcement learning agents. In Advances in Neural Information Processing Systems (NeurIPS), External Links: LinkCited by: §7.4.
[128]M. R. Zhang, N. Desai, J. Bae, J. Lorraine, and J. Ba (2024)Using large language models for hyperparameter optimization. Transactions on Machine Learning Research (TMLR). External Links: LinkCited by: §7.2.
[129]A. Zhao, D. Huang, Q. Xu, M. Lin, Y. Liu, and G. Huang (2024)ExpeL: llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), External Links: Link, DocumentCited by: §7.4.
[130]B. Zoph and Q. V. Le (2016)Neural architecture search with reinforcement learning. In ICLR, Cited by: §6, §6.
[131]B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018)Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.8697–8710. Cited by: §6.
[132]H. Zou, T. Ren, D. Yan, H. Su, and J. Zhu (2019)Reward shaping via meta-learning. arXiv preprint arXiv:1901.09330. Cited by: §2.3.3.

Xet Storage Details

Size:: 181 kB
Xet hash:: 8156be7487cbf1c7e3b3d4829b2d24ee7f02fe20ec0fd6733c062724a4eb9c07

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.