
etd AT Indian Institute of Science >
Division of Electrical Sciences >
Computer Science and Automation (csa) >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2005/812

Title:  Simulation Based Algorithms For Markov Decision Process And Stochastic Optimization 
Authors:  Abdulla, Mohammed Shahid 
Advisors:  Bhatnagar, Shalabh 
Keywords:  Markov Processes  Data Processing Algorithms Simulation Markov Decision Processes (MDPs) Infinite Horizon Markov Decision Processes Finite Horizon Markov Decision Processes Stochastic Approximation  Algorithms Simultaneous Perturbation Stochastic Approximation (SPSA) Network FlowControl FHMDP Algorithms Stochastic Optimization Reinforcement Learning Algorithms 
Submitted Date:  May2008 
Series/Report no.:  G22213 
Abstract:  In Chapter 2, we propose several twotimescale simulationbased actorcritic algorithms for solution of infinite horizon Markov Decision Processes (MDPs) with finite statespace under the average cost criterion. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and averaged for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, a memory efficient implementation using a featurevector representation of the statespace and TD (0) learning along the faster timescale is discussed. A threetimescale simulation based algorithm for solution of infinite horizon discountedcost MDPs via the Value Iteration approach is also proposed. An approximation of the Dynamic Programming operator T is applied to the value function iterates. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is presented. Numerical experiments on rate based flow control on a bottleneck node using a continuoustime queueing model are presented using the proposed algorithms.
Next, in Chapter 3, we develop three simulationbased algorithms for finitehorizon MDPs (FHMDPs). The first algorithm is developed for finite state and compact action spaces while the other two are for finite state and finite action spaces. Convergence analysis is briefly sketched. We then concentrate on methods to mitigate the curse of dimensionality that affects FHMDPs severely, as there is one probability transition matrix per stage. Two parametrized actorcritic algorithms for FHMDPs with compact action sets are proposed, the ‘critic’ in both algorithms learning the policy gradient. We show w.p1convergence to a set with the necessary condition for constrained optima. Further, a third algorithm for stochastic control of stopping time processes is presented. Numerical experiments with the proposed finitehorizon algorithms are shown for a problem of flow control in communication networks.
Towards stochastic optimization, in Chapter 4, we propose five algorithms which are variants of SPSA. The original one measurement SPSA uses an estimate of the gradient of objective function L containing an additional bias term not seen in twomeasurement SPSA. We propose a onemeasurement algorithm that eliminates this bias, and has asymptotic convergence properties making for easier comparison with the twomeasurement SPSA. The algorithm, under certain conditions, outperforms both forms of SPSA with the only overhead being the storage of a single measurement. We also propose a similar algorithm that uses perturbations obtained from normalized Hadamard matrices. The convergence w.p.1 of both algorithms is established. We extend measurement reuse to design three secondorder SPSA algorithms, sketch the convergence analysis and present simulation results on an illustrative minimization problem. We then propose several stochastic approximation implementations for related algorithms in flowcontrol of communication networks, beginning with a discretetime implementation of Kelly’s primal flowcontrol algorithm. Convergence with probability1 is shown, even in the presence of communication delays and stochastic effects seen in link congestion indications. Two relevant enhancements are then pursued :a) an implementation of the primal algorithm using secondorder information, and b) an implementation where edgerouters rectify misbehaving flows. Also, discretetime implementations of Kelly’s dual algorithm and primaldual algorithm are proposed. Simulation results a) verifying the proposed algorithms and, b) comparing stability properties with an algorithm in the literature are presented. 
URI:  http://etd.iisc.ernet.in/handle/2005/812 
Appears in Collections:  Computer Science and Automation (csa)

Items in etd@IISc are protected by copyright, with all rights reserved, unless otherwise indicated.
