Adaptive proportional fair parameterization based LTE scheduling using continuous actor-critic reinforcement learning

Maintaining a desired trade-off performance between system throughput maximization and user fairness satisfaction constitutes a problem that is still far from being solved. In LTE systems, different tradeoff levels can be obtained by using a proper parameterization of the Generalized Proportional Fair (GPF) scheduling rule. Our approach is able to find the best parameterization policy that maximizes the system throughput under different fairness constraints imposed by the scheduler state. The proposed method adapts and refines the policy at each Transmission Time Interval (TTI) by using the Multi-Layer Perceptron Neural Network (MLPNN) as a non-linear function approximation between the continuous scheduler state and the optimal GPF parameter(s). The MLPNN function generalization is trained based on Continuous Actor-Critic Learning Automata Reinforcement Learning (CACLA RL). The double GPF parameterization optimization problem is addressed by using CACLA RL with two continuous actions (CACLA-2). Five reinforcement learning algorithms as simple parameterization techniques are compared against the novel technology. Simulation results indicate that CACLA-2 performs much better than any of other candidates that adjust only one scheduling parameter such as CACLA-1. CACLA-2 outperforms CACLA-1 by reducing the percentage of TTIs when the system is considered unfair. Being able to attenuate the fluctuations of the obtained policy, CACLA-2 achieves enhanced throughput gain when severe changes in the scheduling environment occur, maintaining in the same time the fairness optimality condition.


INTRODUCTION
The optimal allocation of channel and rate resources under a given set of Quality of Service (QoS) requirements constitutes an important throughput maximization task of the scheduling procedure.In particular, the fairness-guaranteed scheduling becomes a complex problem to solve since multiple active users are connected to the base station through the fast fading radio channels and LTE schedulers are designed in the opportunistic manner intended to exploit the multiuser diversity.Hence, by using simple scheduling rules, near Pareto optimal user throughputs should be obtained under a given fairness performance requirement among multiple users.The fairness target selection and the modalities of applying the best scheduling rules in order to satisfy the considered requirement become the main concerns in designing a self-learning LTE scheduler.The Channel Quality Indicator (CQI) feedback as achievable user rate information should be considered in the fairness performance evaluation metric in order to avoid the unfair treatment of some users with unfavorable channel conditions.For this study, the Next Generation Mobile Networks (NGMN) fairness requirement is considered as a fairness criterion in such a way that a system is considered fair if and only if at each TTI t at least (100-x)% of active users achieve at least x% of each normalized user throughput [1].The fairness criterion can be achieved by using a satisfactory parameterization of the GPF scheduling rule [2].The objective function of the current study is designed to maximize the system throughput under the NGMN fairness constraints.With a given input state at each TTI, the scheduler should be able to find the best policy of GPF parameters set to be applied in the current TTI in order to meet the grand NGMN objective.
The continuous and multidimensional scheduler state space is modeled by using the Markov Decision Process (MDP) in which the selected CACLA-2 actions are rewarded based on the transition performance from previous to the current state.Based on given MDP problems, CACLA-2 criticizes each action set in order to localize much faster the optimal scheduler state [3].The experiments show that CACLA-2 performs much better in comparison with other RL algorithms by maximizing the mean user throughput and minimizing, at the same time, the percentage of TTIs when the system stays unfair.The rest of this paper is organized as follows: Section II promotes the relevant techniques proposed in the literature.Section III presents the optimization problem.Section IV presents the architecture of the novel self-learning scheduler.Section V shows the results, and the paper concludes with Section VI.

II. RELATED WORK
The idea of applying the RL principles for the LTE scheduler state space generalization constrained by multiple QoS objectives is originally proposed in [4], [5].In particular, the Q-Learning algorithm with the MLPNN function approximation is used to achieve different static tradeoff levels between system throughput and user fairness [6].The packet scheduling optimization problem in terms of the static Jain Fairness Index (JFI) constraint is analyzed in [7].By imposing a) Jain Fairness Index vs. Mean User Throughput b) CDF distribution Fig. 1.Fairness evaluation criteria (benchmarks) for a 60-user scenario equally distributed from ENodeB base station to the edge of cell under uniform power allocation and FDD downlink transmission with a system bandwidth of 20MHz the fairness limit regardless of the channel conditions makes the approach impractical for the real time schedulers.For this reason, the qualitative fairness measures based on channel statistics, rather than the quantitative fairness thresholds are preferred to be used in practice.The NGMN qualitative fairness measure adaptation techniques in LTE systems were first elaborated in [8] in which the cumulative distribution function (CDF) of the normalized user throughputs is adjusted by using a simple parameterization of the GPF scheduling metric.The CDF curve adaptation to the fairness requirement is achieved at each 1sidealing with the waste of system capacity, especially when the traffic load varies drastically TTI-by-TTI.
A slightly improved method proposed in [7] is introduced in [9], in which the JFI constraint is replaced by the NGMN fairness requirement in the CDF domain (continuous oblique black line from Fig. 1.b.).
A set of RL algorithms that are able to match at each TTI the CDF curve under the NGMN fairness constraint (Fig. 1.b) is proposed in [9].The CACLA-1 actor critic algorithm outperforms any of the methods proposed in [7], [8] and other RL algorithms by maximizing the percentage of TTIs when the system respects the NGMN fairness requirement.It is important to notice that all the proposed methods being illustrated above use a simple parameterization of the GPF scheduling rule.The method proposed in this paper uses the double parameterization of GPF scheduling discipline.It is proved that by using CACLA-2 with two continuous GPF parameters, the obtained policy is able to converge much better to the optimal scheduler state when compared with CACLA-1.

III. GPF OPTIMIZATION PROBLEM
The GPF scheduling metric proposed in [2] exploits two parameters in order to obtain near optimal user throughputs and to adjust the fairness performance in such a way that the NGMN requirement is accomplished.The system model considers a set t  of preselected users with an infinite buffer model with the minimum requested bit rate of 0kbps.At each TTI t a set of  orthogonal sub-carriers called Resource Blocks (RBs) [10] should be shared among the active users in order to solve the GPF integer linear programming optimization problem subject of convex set of constraints as shown by Eq. 1: where   , the well-known proportional fair metric (PF) is obtained.The illustrative mean user throughput and JFI fairness tradeoffs for the aforementioned particular GPF rules are highlighted in Fig. 1.a.The simple GPF parameterization used by other adjusting policies in [6], [7], [8] and [9] is represented by the special case of GPF simple parameterization (GPF-1) when   parameters TTI-by-TTI in order to reach the optimal or feasible scheduler state (green tradeoff values from Fig. 1. a) such that: is the MLPNN output space or the RL algorithm action space.For CACLA-2, the action space is . For other RL algorithms with discrete action spaces exposed and analyzed in this paper, the action at TTI t becomes  when the discrete or continuous action t  is applied for the scheduling procedure.
The role of the RL approaches is to drive the scheduler in the feasible state   represents the collection of multi-dimensional data points when the scheduler meets the feasible state for different channel and network conditions.When the applied action moves the throughput-JFI domain on the left side of    zone, the scheduler is declared unfair (MT scheduling rule case) and   denotes the region of unfair states.Otherwise, the scheduler is considered to be overfair (MF scheduling rule case) and   , where the sub-space as the CDF function when all observations represents the NGMN fairness requirement (continuous oblique black line), then the aggregated NOF function is calculated based on Eq. 3: where Based on NGMN specifications [1], the scheduler state is fair , where the fairness . The delimitation between feasible area and over-fairness area is given by the superior CDF limit Max  (oblique dot black line in Fig. 1.b) such that: where   is the confidence parameter that can guarantee the feasible region detection during the exploration period.For a larger  parameter, the    region can be detected much faster by degrading the system throughput whereas when the confidence parameter is small enough, more exploration time is required for CACLA-2 to localize the feasible state.The scheduler state status is decided based on Eq. 5: The purpose of CACLA-2 is to find the feasible state  

IV. THE SELF-LEARNING LTE-A SCHEDULER ARCHITECTURE
The proposed actor-critic RL algorithm learns the optimal policy of     actions based on the interaction between the conventional LTE scheduler and the novel controller.At each TTI t, the controller receives from the scheduler a new    , the LTE controller has to decide based on RL approach which action t  should be applied at the current TTI t (Fig. 2).In other words, the controller has to learn how to behave for many  problems.In this sense, the controller requires the state value function and the action-state value . When RL approaches with discrete actions are used, different action-state values k t Q are requested for each k t  action.The idea is to upgrade these values based on an infinite number of iterations by using the temporal learning principles [11].The way how the tuple      which can provide the largest amount maximum rewards averaged over the number of training epochs or number of training TTIs.In order to reduce the MLPNN structure complexity and to speed up the convergence in the  region, the initial scheduler state space has to be aggregated by using a special preprocessing stage.

A. LTE Controller State Space
Due to the continuous and high dimensionality characteristics of the original scheduler state space where  are the mean and the standard deviations, respectively for the log-normal distribution of NUT observations, R t d is the minimum/maximum difference between i  and Re q i  percentiles when the system is fair/unfair as indicated in [9], and t  represents the system status flags which indicates that 1

B. Reward Function
The objective function from Eq. 3 is not suitable as a reward function due to the oscillated characteristics of the LTE scheduling procedure.Then, the particularities of GPF-2 should be highlighted for the reward function computation.Compared with CACLA-1 [9], CACLA-2 reveals the existence of multiple optimal solutions   , when C t    .In Fig. 1, the feasibility is reached when let us say   0.5, 1 , then the system tends to become overfair C t    .Based on these principles, the feasibility can be reached for multiple optimal set of parameters when   , t t     .Then the reward function for GPF-2 can be divided as indicated in Eq. 7: When C t    , the reward function UF t  can be modeled by using the tuple of   For the over-fairness case, the reward function should follow the opposite direction of Eq. 7.a.The desirable situation is denoted by  and the undesirable one by the case when In the first instance During the training stage, the role of CACLA-2 RL algorithm is to collect the maximum rewards TTI-by-TTI and to select based on the MDP problem t  such actions t  in order to improve and to refine the final policy of GPF-2 parameters.

C. CACLA-2 RL Algorithm
Based on CACLA principles, two MLPNN functions are used for the approximations of state and action-state values.Let us define the forwarded action-state and state values respectively, where ,   are the trained MLPNN weights at TTI t.The MLPNN weights are updated according to the gradientdescent principle which aims to minimize the mean-square errors  A t E and  V t E based on Eq. 8.a and Eq.8.b, respectively: are the target values, A  and V  are the learning rates, and  is the total number of hidden nodes for all MLPNN layers.The errors from (8.a) and (8.b) are back-propagated layer-to-layer for each MLPNN node.
The target state value is updated at TTI t when the reward value is received as a result of applying the previous action  according to Eq. 9: where is the discount factor that indicates the importance of future scheduler rewards.
In order to find the best policy of GPF parameters, the training procedure uses a combination of exploration and exploitation stages which permit to select greedy actions with a probability of   where t  is the two-dimensional real random number and Th  is the greedy threshold which decides if there is policy evaluation (10.a) or policy improvement (10.b).Equation 10 is entitled actor scheme for the CACLA-2 algorithm.The critic scheme updates the action target value in state According to Eq. 11, the target action value for the state

D. Other RL Candidates
The performance of actor-critic schemes is compared against the actor RL schemes which are well known in the specialty literature.Double Q-Learning [11] is a modified version of the standard Q-learning that uses a double estimator function in the sense that two agents store two action value functions.QV-2 learning works by keeping a track of both  action and state values and differs from the original QVlearning from the MLNN error functions point of view [12].QVMAX and QVMAX2 algorithms are off-line RL procedures in the sense that they combine the state, actionstate values and error function computation based on QV2 and Q learning approaches [13].The RL candidates use discrete action sets by adjusting the GPF-1 parameterization problem.

V. SIMULATION RESULTS
In order to prove the eligibility of CACLA-2 actor-critic RL algorithm in comparison with other methods, the considered scenario fluctuates at each 1s the number of active users based on the  -greedy probability in the interval of [10,120], and the user mobility is considered to be random walk with 30kmph speed.The evaluation of each scheduling algorithm is achieved by using the same conditions for interference, path loss, shadowing and fast fading.Each base station transmits with the same power which is equally distributed for all RBs.The best effort traffic type is considered for the downlink transmission purpose.The CQI feedback which is sent in the uplink direction is considered to be errorless.The rest of the parameters of the LTE scheduler can be found in TABLE I.The LTE controller trains each RL algorithm for 1000s.Then, the resulted policy is exploited for 200s.Due to the fact that Double-Q-learning algorithm spends too much time in the irrelevant state-space regions, after the exploration period, the controller is decoupled from the LTE scheduler and is retrained based on the visited MDP problems for a duration of about 500s.Except CACLA-1, CACLA-2 and Double-Q algorithms, the other candidates use the Boltzmann policy for the action selection procedure [13].The rest of the parameters for LTE controller can be found in TABLE II.
As shown in Fig. 3, CACLA-2, CACLA-1, QVMAX, QXMAX2 and Double-Q RL algorithms satisfy the NGMN fairness requirement when the number of active users remains constant.From the CDF perspective, the PF scheduling metric and QV-2 RL algorithm localize the scheduler in the  region.The evolution of learned policies times is depicted in Fig. 4 based on the number of bearers in the active state.QV-2, Double-Q and QVMAX learning algorithms provide the highest fluctuations of t  parameter with the lowest policy revenue capacity to the optimal value.On the other side, CACLA-1 and CACLA-2 exploit the critic scheme advantage by keeping the policy oscillations in acceptable limits.By using two continuous actions, CACLA-2 is able to recover the  state much faster when the traffic load is varying.By increasing t  and decreasing t  when the number of users increases from 12 to 115 (Fig. 4), the policy stabilizes in less than 10ms by recovering the stability of the policy.A significant system throughput gain can be achieved by minimizing in the same time the percentage of TTIs when the scheduler is located in the  operating area.
When the minimum/maximum NGMN distance R t d is considered (Fig. 5), CACLA-2 outperforms the main candidate CACLA-1 by maintaining the system in the minimum distances range of [0, 0.03].The result of the policy fluctuations of other candidates is directly impacted in Fig. 5 where the NGMN distances converge much slower than the actor-critic schemes.The PF scheduling rule indicates the highest NGMN distance for all transmission period.These concepts explain the higher throughput gain from Fig. 6 of actor-critic schemes when compared against the other candidates.In particular, CACLA-2 indicates a throughput gain   when compared with CACLA-1 of about 0.2Mbps.The PF metric shows the worst performance even when the JFI-mean user throughput tradeoff is considered leading to the waste of system capacity when the NGMN fairness requirement is considered.From the percentage of TTIs when the system state is  ,  or  (Fig. 7) in the exploitation state, the static parameterization of PF scheduling rule shows the highest amount of TTIs when the scheduler is over-fair and the lowest percentage of TTIs when the system is feasible.When the simple parameterization is used, QV2-learning constitutes the worst choice when the scheduler state is   From the view point of the number of TTIs when the scheduler is over-fair, the best performance is obtained by using the QVMAX policy.CACLA-1 algorithm outperforms the other candidates with the simple parameterization scheme when the feasible region is met.When the proposed GPF-2 parameterization is used, CACLA-2 outperforms any of other RL methods.CACLA-2 gains more than 3000 TTIs from the  region which are valued by the  zone.In the same time, CACLA-2 gains around 6% feasible TTIs when compared with the main candidate CACLA-1 algorithm.From other perspective, by increasing the number of feasible TTIs, the number of reward punishments   in the exploitation stage is strongly reduced.This concept highlights the quality of the proposed policy and the ability of recovering the desired feasible state in less than 10 TTIs when severe changes in the traffic load and user channel conditions appear.

VI. CONCLUSIONS
The current work shows that the use of the double GPF parameterization increases the percentage of TTIs with 6% when the scheduler is feasible in comparison with the simple parameterization technique.The percentages of TTIs when the system is considered  or  indicate a real improvement of about 1.32% and 4.71%, respectively, when CACLA-2 is performed.By using double action space, the resulted policy indicates lower fluctuations when the traffic load drastically changes.In conclusion, the double parameterization of the GPF scheduling rule presents real improvements in terms of system throughput gains and percentages of TTIs when the system is considered feasible.
translating the quantitative tradeoff evaluation (Fig.1.a.) to qualitative NGMN fairness evaluation (Fig.1.b.), the scheduler state space status is decided based on the NGMN Objective Function (NOF)

Fig. 2
Fig.2The architecture of the self-learning LTE-A scheduler value functions are directly applied to the new  , then the exploitation stage is performed.The purpose of the training stage is to find the optimal policy ( , )S t t

Fig. 3 .
Fig.3.The CDF curve of the proposed policies

Fig. 7
Fig. 7 Percentage of TTIs when the scheduler seats on the UF/FEA/OF states regions 1  .Obviously, it is expected that CACLA-1 requires more time to optimize t  than CACLA-2 which explores for a more restrictive domain of parameters.However, the double parameterization learning technique should adapt the set of max 

TABLE I .
LTE SCHEDULER PARAMETERS

TABLE II .
LTE CONTROLLER PARAMETERS