Closing the gap for online lossy source coding

In Chapter 7, we provided a sequential lossy source coding scheme that achieves a normal-ized distortion redundancy ofO⁽^p^ln(T^)/T) relative to any finite reference class of limited-delay limited-memory codes, improving the earlier results ofO(T^−1/3). Applied to the case when the reference class is the (infinite) set of scalar quantizers, we showed that the algorithm achieves O^(ln(T^)/^p^T) normalized distortion redundancy, which is almost optimal in view that the nor-malized distortion redundancy is known to be at least of order 1/p

T. The existence of a coding scheme with optimal high-confidence performance guarantees depends on the existence of an online prediction algorithm with high-confidence guarantees on its regret and switch-number.

As discussed in the previous section, this remains an open problem.

Bibliography

[1] Abbasi-Yadkori, Y. and Szepesvári, Cs. (2011). Regret bounds for the adaptive control of linear quadratic systems. InProceedings of the Twenty-Fourth Conference on Computational Learning Theory.

[2] Allenberg, C., Auer, P., Györfi, L., and Ottucsák, Gy. (2006). Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. InALT, pages 229–243.

[3] Arora, R., Dekel, O., and Tewari, A. (2012). Deterministic MDPs with adversarial rewards and bandit feedback.CoRR, abs/1210.4843.

[4] Audibert, J.-Y. and Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring.Journal of Machine Learning Research, 11:2785–2836.

[5] Audibert, J. Y., Bubeck, S., and Lugosi, G. (2011). Minimax policies for combinatorial pre-diction games. InConference on Learning Theory.

[6] Audibert, J. Y., Bubeck, S., and Lugosi, G. (2012). Regret in online combinatorial optimiza-tion.Manuscript.

[7] Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3:397–422.

[8] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem.Mach. Learn., 47(2-3):235–256.

[9] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. (1995). Gambling in a rigged casino:

The adversarial multi-armed bandit problem. InProceedings of the 36th Annual Symposium on the Foundations of Computer Science, pages 322–331. IEEE press.

[10] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic mul-tiarmed bandit problem.SIAM J. Comput., 32(1):48–77.

[11] Auer, P. and Ortner, R. (2006). Logarithmic online regret bounds for undiscounted rein-forcement learning. InNIPS’06, pages 49–56.

[12] Awerbuch, B. and Kleinberg, R. D. (2004). Adaptive routing with end-to-end feedback:

distributed learning and geometric approaches. InProceedings of the 36th ACM Symposium on Theory of Computing, pages 45–53.

[13] Balluchi, A., Benvenuti, L., Di Benedetto, M. D., Pinello, C., and Sangiovanni-Vincentelli, A. L. (2000). Automotive engine control and hybrid systems: challenges and opportunities.

InProceedings of the IEEE, pages 888–912.

[14] Bartlett, P. L., Dani, V., Hayes, T. P., Kakade, S., Rakhlin, A., and Tewari, A. (2008). High-probability regret bounds for bandit online linear optimization. In Servedio, R. A. and Zhang, T., editors,21st Annual Conference on Learning Theory (COLT 2008), pages 335–342. Omni-press.

[15] Bartlett, P. L. and Tewari, A. (2009). REGAL: A regularization based algorithm for reinforce-ment learning in weakly communicating MDPs. InProceedings of the 25th Annual Confer-ence on Uncertainty in Artificial IntelligConfer-ence.

[16] Bartók, G., Pál, D., Szepesvári, Cs., and Szita, I. (2011). Online learning. Lecture notes, University of Alberta. https://moodle.cs.ualberta.ca/file.php/354/notes.pdf.

[17] Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scien-tific, Belmont, MA.

[18] Boucheron, S., Lugosi, G., and Massart, P. (2013).Concentration inequalities:A Nonasymp-totic Theory of Independence. Oxford University Press.

[19] Brafman, R. I. and Tennenholtz, M. (2002). R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning.Journal of Machine Learning Research, 3:213–231.

[20] Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Now Publishers Inc.

[21] Cao, X.-R. (2007). Stochastic Learning and Optimization: A Sensitivity-Based Approach.

Springer, New York.

[22] Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., Auer, P., and Auer, P. (2011).

Upper-confidence-bound algorithms for active learning in multi-armed bandits. InALT, pages 189–203.

[23] Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of on-line learning algorithms.IEEE Transactions on Information Theory, 50:2050–2057.

[24] Cesa-Bianchi, N., Dekel, O., and Shamir, O. (2013). Online Learning with Switching Costs and Other Adaptive Adversaries.ArXiv e-prints.

[25] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA.

[26] Cesa-Bianchi, N. and Lugosi, G. (2009). Combinatorial bandits. InProceedings of the Twenty-Second Conference on Computational Learning Theory, pages 237–246.

[27] Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits.Journal of Computer and System Sciences, 78:1404–1422.

[28] Dani, V., Hayes, T. P., and Kakade, S. (2008a). The price of bandit information for online optimization. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors,Advances in Neural Information Processing Systems 20, pages 345–352. MIT Press.

[29] Dani, V., Hayes, T. P., and Kakade, S. M. (2008b). Stochastic linear optimization under ban-dit feedback. In Servedio, R. A. and Zhang, T., eban-ditors,21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 355–366. Omnipress.

[30] Devroye, L., Lugosi, G., and Neu, G. (2013a). Prediction by random-walk perturbation.

Submitted to the Twenty-Sixth Conference on Learning Theory.

[31] Devroye, L., Lugosi, G., and Neu, G. (2013b). Prediction by random-walk perturbation.

Submitted to the IEEE Transactions on Information Theory.

[32] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2005). Experts in a Markov decision process.

In Saul, L. K., Weiss, Y., and Bottou, L., editors,Advances in Neural Information Processing Systems 17, pages 401–408.

[33] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online Markov decision processes.

Mathematics of Operations Research, 34(3):726–736.

[34] Even-dar, E., Mannor, S., and Mansour, Y. (2002). PAC bounds for multi-armed bandit and Markov decision processes. InIn Fifteenth Annual Conference on Computational Learning Theory (COLT), pages 255–270.

[35] Feller, W. (1968). An Introduction to Probability Theory and its Applications, Vol. 1. John Wiley, New York.

[36] Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55:119–139.

[37] Garivier, A. and Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond.CoRR, abs/1102.2490.

[38] Gentile, C. and Warmuth, M. (1998). Linear hinge loss and average margin. InAdvances in Neural Information Processing Systems (NIPS), pages 225–231.

[39] Geulen, S., Voecking, B., and Winkler, M. (2010). Regret minimization for online buffering problems using the weighted majority algorithm. InProceedings of the Twenty-Third Con-ference on Computational Learning Theory.

[40] Grove, A., Littlestone, N., and Schuurmans, D. (2001). General convergence results for linear discriminant updates.Machine Learning, 43:173–210.

[41] Györfi, L. and Ottucsák, Gy. (2007). Sequential prediction of unbounded stationary time series.IEEE Transactions on Information Theory, 53(5):866–1872.

[42] Györfi, L. and Walk, H. (2012). Empirical portfolio selection strategies with proportional transaction costs.IEEE Transactions on Information Theory, 58(10):6320–6331.

[43] György, A., Linder, T., and Lugosi, G. (2004a). Efficient algorithms and minimax bounds for zero-delay lossy source coding.IEEE Transactions on Signal Processing, 52:2337–2347.

[44] György, A., Linder, T., and Lugosi, G. (2004b). A "follow the perturbed leader"-type al-gorithm for zero-delay quantization of individual sequences. InProc. Data Compression Conference, pages 342–351, Snowbird, UT, USA.

[45] György, A., Linder, T., and Lugosi, G. (2008). Tracking the best quantizer.IEEE Transactions on Information Theory, 54:1604–1625.

[46] György, A., Linder, T., and Lugosi, G. (2012). Efficient tracking of large classes of experts.

IEEE Transactions on Information Theory. (accepted with minor revisions).

[47] György, A., Linder, T., Lugosi, G., and Ottucsák, Gy. (2007). The on-line shortest path prob-lem under partial monitoring.Journal of Machine Learning Research, 8:2369–2403.

[48] György, A. and Neu, G. (2011). Near-optimal rates for limited-delay universal lossy source coding. InProceedings of the IEEE International Symposium on Information Theory (ISIT), pages 2344–2348.

[49] György, A. and Neu, G. (2013). Near-optimal rates for limited-delay universal lossy source coding.Submitted to the IEEE Transactions on Information Theory.

[50] György, A. and Ottucsák, Gy. (2006). Adaptive routing using expert advice. Computer Journal, 49(2):180–189.

[51] Hannan, J. (1957). Approximation to Bayes risk in repeated play. Contributions to the theory of games, 3:97–139.

[52] Hartmann, B. and Dán, A. (2012). Cooperation of a grid-connected wind farm and an energy storage unit demonstration of a simulation tool. IEEE Transactions on Sustainable Energy, 3(1):49–56.

[53] Hazan, E., Kale, S., and Warmuth, M. (2010). Learning rotations with little regret. In Pro-ceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 144–154.

[54] Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms for online decision problems.

Electronic Colloquium on Computational Complexity (ECCC).

[55] Hazan, E. and Seshadhri, C. (2009). Efficient learning algorithms for changing environ-ments. InProceedings of the 26th Annual International Conference on Machine Learning, pages 393–400. ACM.

[56] Helmbold, D. P. and Warmuth, M. (2009). Learning permutations with exponential weights.Journal of Machine Learning Research, 10:1705–1736.

[57] Herbster, M. and Warmuth, M. (1998). Tracking the best expert. Machine Learning, 32:151–178.

[58] Ipsen, I. C. F. and Selee, T. M. (2011). Ergodicity coefficients defined by vector norms.SIAM J. Matrix Analysis Applications, 32(1):153–200.

[59] Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning.J. Mach. Learn. Res., 99:1563–1600.

[60] Kakade, S. (2003).On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London.

[61] Kakade, S., Kearns, M. J., and Langford, J. (2003). Exploration in metric state spaces. In ICML2003, pages 306–312.

[62] Kakade, S., Sridharan, K., and Tewari, A. (2009). On the complexity of linear prediction:

Risk bounds, margin bounds, and regularization. InNIPS-22, pages 793–800.

[63] Kalai, A. and Vempala, S. (2003). Efficient algorithms for the online decision problem.

In Schölkopf, B. and Warmuth, M., editors,Proceedings of the 16th Annual Conference on Learning Theory and the 7th Kernel Workshop, COLT-Kernel 2003, pages 26–40, New York, USA. Springer.

[64] Kalai, A. and Vempala, S. (2005). Efficient algorithms for online decision problems. J.

Comput. Syst. Sci., 71:291–307.

[65] Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. InALT’12, pages 199–213.

[66] Kivinen, J. and Warmuth, M. (2001). Relative loss bounds for multidimensional regression problems.Machine Learning, 45:301–329.

[67] Kocsis, L. and Szepesvári, Cs. (2006). Bandit based Monte-Carlo planning. InProceedings of the 17th European conference on Machine Learning, pages 282–293.

[68] Koolen, W., Warmuth, M., and Kivinen, J. (2010). Hedging structured concepts. In Pro-ceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 93–105.

[69] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Ad-vances in Applied Mathematics, 6:4–22.

[70] Lazaric, A. and Munos, R. (2011). Learning with stochastic inputs and adversarial outputs.

To appear in Journal of Computer and System Sciences.

[71] Linder, T. and Lugosi, G. (2001). A zero-delay sequential scheme for lossy coding of indi-vidual sequences.IEEE Transactions on Information Theory, 47:2533–2538.

[72] Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108:212–261.

[73] Maillard, O.-A., Munos, R., and Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences.Journal of Machine Learning Research -Proceedings Track, 19:497–514.

[74] Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-armed bandit problem.J. Mach. Learn. Res., 5:623–648.

[75] Matloub, S. and Weissman, T. (2006). Universal zero delay joint source-channel coding.

IEEE Transactions on Information Theory, 52:5240–5250.

[76] McMahan, H. B. and Blum, A. (2004). Online geometric optimization in the bandit setting against an adaptive adversary. InProceedings of the Eighteenth Conference on Computational Learning Theory, pages 109–123.

[77] Mnih, V., Szepesvári, Cs., and Audibert, J.-Y. (2008). Empirical Bernstein stopping. In ICML, pages 672–679.

[78] Neu, G., György, A., and Szepesvári, Cs. (2010a). The online loop-free stochastic shortest-path problem. InProceedings of the Twenty-Third Conference on Computational Learning Theory, pages 231–243.

[79] Neu, G., György, A., and Szepesvári, Cs. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. InProceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 ofJMLR Workshop and Confer-ence Proceedings, pages 805–813, La Palma, Canary Islands.

[80] Neu, G., György, A., and Szepesvári, Cs. (2013a). The online loop-free stochastic shortest-path problem.In preparation.

[81] Neu, G., György, A., Szepesvári, Cs., and Antos, A. (2010b). Online Markov decision pro-cesses under bandit feedback. InAdvances in Neural Information Processing Systems 23, pages 1804–1812.

[82] Neu, G., György, A., Szepesvári, Cs., and Antos, A. (2013b). Online Markov decision pro-cesses under bandit feedback. Accepted for publication at the IEEE Transactions on Auto-matic Control.

[83] Ortner, R. (2008). Online regret bounds for Markov decision processes with determinis-tic transitions. InProceedings of the 19th International Conference on Algorithmic Learning Theory, ALT 2008.

[84] Poland, J. (2005). FPL analysis for adaptive bandits. InIn 3rd Symposium on Stochastic Algorithms, Foundations and Applications (SAGA’05), pages 58–69.

[85] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimen-sionality. John Wiley and Sons, New York.

[86] Puterman, M. L. (1994).Markov Decision Processes: Discrete Stochastic Dynamic Program-ming. Wiley-Interscience.

[87] Rakhlin, A., Sridharan, K., and Tewari, A. (2011). Online learning: Stochastic and con-strained adversaries. InNIPS-24.

[88] Reani, A. and Merhav, N. (2009). Efficient on-line schemes for encoding individual se-quences with side information at the decoder. InProc. IEEE International Symposium on Information Theory (ISIT 2009), pages 1025–1029.

[89] Reani, A. and Merhav, N. (2011). Efficient on-line schemes for encoding individual se-quences with side information at the decoder. IEEE Transactions on Information Theory, pages 6860–6876.

[90] Shamir, G. I. and Merhav, N. (1999). Low-complexity sequential lossless coding for piecewise-stationary memoryless sources. IEEE Transactions on Information Theory, IT-45:1498–1519.

[91] Strehl, A. L. and Littman, M. L. (2008). Online linear regression and its application to model-based reinforcement learning. InNIPS20, pages 1417–1424.

[92] Sutton, R. and Barto, A. (1998).Reinforcement Learning: An Introduction. MIP Press.

[93] Szepesvári, Cs. (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Ar-tificial Intelligence and Machine Learning. Morgan & Claypool Publishers.

[94] Takimoto, E. and Warmuth, M. (2003). Paths kernels and multiplicative updates. Journal of Machine Learning Research, 4:773–818.

[95] Takimoto, E. and Warmuth, M. K. (2002). Path kernels and multiplicative updates. In Kivi-nen, J. and Sloan, R. H., editors,Proceedings of the 15th Annual Conference on Computational Learning Theory, COLT 2002, LNAI 2375, pages 74–89, Berlin–Heidelberg. Springer-Verlag.

[96] Vovk, V. (1990). Aggregating strategies. InProceedings of the third annual workshop on Computational learning theory (COLT), pages 371–386.

[97] Vovk, V. (1998). A game of prediction with expert advice.Journal of Computer and System Sciences, 56(2):153–173.

[98] Vovk, V. (1999). Derandomizing stochastic prediction strategies. Machine Learning, 35(3):247–282.

In document Online learning in non-stationary Markov decision processes (Pldal 139-147)