Підтримка
www.wikidata.uk-ua.nina.az
Q navcha nnya angl Q learning ce algoritm bezmodelnogo navchannya z pidkriplennyam Metoyu Q navchannya ye navchitisya strategiyi yaka kazhe agentovi do yakoyi diyi vdavatisya za yakih obstavin Vono ne vimagaye modeli seredovisha zvidsi utochnennya bezmodelnogo i mozhe rozv yazuvati zadachi zi stohastichnimi perehodami ta vinagorodami ne vimagayuchi pristosuvan Dlya bud yakogo skinchennogo markovskogo procesu virishuvannya SMPV angl finite Markov decision process FMDP Q navchannya znahodit strategiyu yaka ye optimalnoyu v tomu sensi sho vona maksimizuye ochikuvane znachennya povnoyi vinagorodi nad bud yakimi ta usima poslidovnimi krokami pochinayuchi z potochnogo stanu Q navchannya mozhe viznachati optimalnu strategiyu obirannya dij dlya dovilnogo SMPV za umovi neskinchennogo chasu na rozviduvannya ta chastkovo vipadkovoyi strategiyi Simvolom Q poznachayut funkciyu yaka povertaye vinagorodu sho vikoristovuyut dlya zabezpechennya pidkriplennya i pro yaku mozhlivo skazati sho vona vidpovidaye yakosti angl Quality diyi obranoyi v potochnomu stani Navchannya z pidkriplennyamDokladnishe Navchannya z pidkriplennyam Navchannya z pidkriplennyam vklyuchaye agenta mnozhinu staniv angl states S displaystyle S ta mnozhinu dij za stanami angl actions A displaystyle A Vikonuyuchi diyu a A displaystyle a in A agent perehodit zi stanu do stanu Vikonannya diyi v pevnomu stani zabezpechuye agenta vinagorodoyu angl reward chislovim balom Metoyu agenta ye maksimizuvati svoyu zagalnu majbutnyu vinagorodu Vin robit ce dodavannyam maksimalnoyi vinagorodi yakoyi mozhlivo dosyagti z majbutnih staniv do vinagorodi za dosyagnennya jogo potochnogo stanu vplivayuchi takim chinom na potochnu diyu potencijnoyu majbutnoyu vinagorodoyu Cya potencijna vinagoroda ye zvazhenoyu sumoyu matematichnogo spodivannya vinagorod usih nastupnih krokiv pochinayuchi z potochnogo stanu Yak priklad rozglyanmo proces posadki na potyag v yakomu vinagoroda vimiryuyetsya vid yemnim znachennyam zagalnogo chasu vitrachenogo na posadku inshimi slovami vitrati na posadku na potyag dorivnyuyut trivalosti posadki Odniyeyu zi strategij ye zahoditi do dverej potyagu shojno voni vidchinilisya minimizuyuchi pochatkovij chas ochikuvannya dlya sebe Prote yaksho v potyazi bagato lyudej vi matimete povilne vhodzhennya pislya pochatkovoyi diyi vhodzhennya do dverej oskilki lyudi borotimutsya z vami shobi vijti z potyagu v toj chas yak vi namagayetesya zajti Todi zagalnij chas posadki abo vitrati stanovlyat 0 sekund chasu ochikuvannya 15 sekund chasu borotbi Nastupnogo dnya vipadkovo rozviduvannya vi virishuyete zachekati j dati spershu vijti inshim lyudyam Ce spochatku prizvodit do trivalishogo chasu ochikuvannya Prote chas borotbi z inshimi pasazhirami staye menshim Zagalom cej shlyah maye vishu vinagorodu nizh poperednogo dnya oskilki zagalnij chas posadki teper skladaye 5 sekund chasu ochikuvannya 0 sekund chasu borotbi Zavdyaki rozvidci nezvazhayuchi na te sho pochatkova terplyacha diya prizvodit do bilshih vitrat abo vid yemnoyi vinagorodi nizh u silovij strategiyi zagalni vitrati ye nizhchimi vidkrivayuchi takim chinom korisnishu strategiyu AlgoritmTablicya Q navchannya staniv za diyami inicializovana nulyami a potim kozhnu klitinku utochneno shlyahom trenuvannya Vaga dlya kroku zi stanu na D t displaystyle Delta t krokiv u majbutnomu obchislyuyetsya yak g D t displaystyle gamma Delta t de g displaystyle gamma koeficiyent znecinyuvannya angl discount factor ye chislom z 0 po 1 0 g 1 displaystyle 0 leqslant gamma leqslant 1 i daye efekt ocinyuvannya vinagorod otrimanih ranishe vishe za otrimani piznishe vidobrazhayuchi cinnist dobrogo pochatku g displaystyle gamma takozh mozhna tlumachiti yak imovirnist dosyagnennya uspihu abo vizhivannya na kozhnomu kroci D t displaystyle Delta t Cej algoritm otzhe maye funkciyu sho obchislyuye yakist kombinaciyi stan diya Q S A R displaystyle Q S times A to mathbb R Pered pochatkom navchannya Q displaystyle Q inicializuyut mozhlivo dovilnim fiksovanim znachennyam sho obiraye programist Potim kozhnogo momentu chasu t displaystyle t agent obiraye diyu a t displaystyle a t sposterigaye vinagorodu r t displaystyle r t vhodit do novogo stanu s t 1 displaystyle s t 1 sho mozhe zalezhati yak vid poperednogo stanu s t displaystyle s t tak i vid obranoyi diyi j Q displaystyle Q utochnyuyetsya Yadrom algoritmu ye rivnyannya Bellmana yak proste utochnennya z iteraciyeyu za cinnostyami sho vikoristovuye zvazhene userednennya staroyi cinnosti ta novoyi informaciyi Q n e w s t a t Q s t a t old value a learning rate r t reward g discount factor max a Q s t 1 a estimate of optimal future value new value temporal difference target Q s t a t old value temporal difference displaystyle Q new s t a t leftarrow underbrace Q s t a t text old value underbrace alpha text learning rate cdot overbrace bigg underbrace underbrace r t text reward underbrace gamma text discount factor cdot underbrace max a Q s t 1 a text estimate of optimal future value text new value temporal difference target underbrace Q s t a t text old value bigg text temporal difference de r t displaystyle r t ye vinagorodoyu angl reward otrimuvanoyu pri perehodi vid stanu s t displaystyle s t do stanu s t 1 displaystyle s t 1 a a displaystyle alpha ye tempom navchannya angl learning rate 0 lt a 1 displaystyle 0 lt alpha leqslant 1 Epizod algoritmu zakinchuyetsya todi koli stan s t 1 displaystyle s t 1 ye zavershalnim abo terminalnim stanom angl final terminal state Tim ne menshe Q navchannya mozhe takozh navchatisya i v ne epizodovih zavdannyah dzherelo Yaksho koeficiyent znecinyuvannya angl discount factor ye menshim za 1 to cinnosti angl value dij ye skinchennimi navit yaksho zadacha mozhe mistiti neskinchenni cikli Dlya vsih zavershalnih staniv s f displaystyle s f Q s f a displaystyle Q s f a nikoli ne utochnyuyetsya a vstanovlyuyetsya u znachennya vinagorodi r displaystyle r sposterezhene dlya stanu s f displaystyle s f U bilshosti vipadkiv Q s f a displaystyle Q s f a mozhe buti vzyato rivnim nulevi Vpliv zminnihTemp navchannya Temp navchannya angl learning rate abo rozmir kroku angl step size viznachaye yakoyu miroyu novootrimana informaciya pereviznachatime staru Koeficiyent 0 zrobit tak sho agent ne navchatimetsya nichogo vikoristovuyuchi viklyuchno apriorne znannya todi yak koeficiyent 1 zrobit tak sho agent rozglyadatime lishe najnovishu informaciyu ignoruyuchi poperednye znannya dlya rozviduvannya mozhlivostej U povnistyu determinovanih seredovishah optimalnim ye temp navchannya a t s a 1 displaystyle alpha t s a 1 Koli zadacha ye stohastichnoyu algoritm vse odno zbigatimetsya za deyakih tehnichnih umov na temp navchannya yaki vimagayut shobi vin znizhuvavsya do nulya Na praktici chasto zastosovuyut nezminnij temp navchannya takij yak a t 0 1 displaystyle alpha t 0 1 dlya vsih t displaystyle t Koeficiyent znecinyuvannya Koeficiyent znecinyuvannya angl discount factor g displaystyle gamma viznachaye vazhlivist majbutnih vinagorod Koeficiyent 0 zrobit agenta korotkozorim sho rozglyadatime lishe potochni vinagorodi tobto r t displaystyle r t v navedenomu vishe pravili utochnyuvannya todi yak koeficiyent sho nablizhuyetsya do 1 zrobit jogo takim sho pragne dovgotrivaloyi visokoyi vinagorodi Yaksho koeficiyent znecinyuvannya dorivnyuye abo perevishuye 1 cinnosti dij mozhut rozbigatisya Dlya g 1 displaystyle gamma 1 bez terminalnogo stanu abo yaksho agent nikoli ne dosyagaye takogo vsi istoriyi seredovisha stayut neskinchenno dovgimi i korisnosti z dodatnimi vinagorodami bez znecinyuvannya v zagalnomu vipadku stayut neskinchennimi Navit iz koeficiyentom znecinyuvannya lishe troshki menshim za 1 navchannya Q funkciyi vede do poshirennya pohibok ta nestabilnostej koli funkciya cinnosti nablizhuyut shtuchnoyu nejronnoyu merezheyu V takomu vipadku pochatok iz nizhchogo koeficiyenta znecinyuvannya ta zbilshennya jogo v napryamku jogo kincevogo znachennya priskoryuye navchannya Pochatkovi umovi Q0 Oskilki Q navchannya ye iterativnim algoritmom vono neyavno peredbachaye yakis pochatkovi umovi pered tim yak stanetsya pershe utochnennya Visoki pochatkovi cinnosti vidomi takozh yak optimistichni pochatkovi umovi mozhut zaohochuvati rozviduvannya angl exploration ne vazhlivo yaku diyu obrano pravilo utochnennya prizvede do togo sho vona matime nizhchi cinnosti nizh inshi alternativi pidvishuyuchi takim chinom imovirnist yihnogo obrannya Dlya skidannya pochatkovih umov mozhlivo zastosovuvati pershu vinagorodu r displaystyle r Vidpovidno do ciyeyi ideyi pri pershomu vikonanni diyi cyu vinagorodu vikoristovuyut dlya vstanovlennya znachennya Q displaystyle Q Ce robit mozhlivim negajne navchannya u vipadku fiksovanih deterministichnih vinagorod Ochikuyetsya sho model yaka vklyuchaye skidannya pochatkovih umov SPU angl reset of initial conditions RIC peredbachuvatime povedinku uchasnika krashe nizh model yaka dopuskaye bud yaki dovilni pochatkovi umovi DPU angl arbitrary initial condition AIC SPU vidayetsya vidpovidnim lyudskij povedinci v eksperimentah iz povtoryuvanim dvijkovim viborom VtilennyaQ navchannya v najprostishomu viglyadi zberigaye dani v tablicyah Cej pidhid shportayetsya zi zbilshennyam kilkostej staniv dij oskilki pravdopodibnist vidviduvannya agentom pevnogo stanu j vikonannya pevnoyi diyi staye vse menshoyu Nablizhennya funkcij Q navchannya mozhlivo poyednuvati z Ce umozhlivlyuye zastosuvannya danogo algoritmu do bilshih zadach navit koli prostir staniv ye neperervnim Odnim z rishen ye vikoristovuvati yak nablizhuvach funkcij pristosovanu dlya cogo shtuchnu nejronnu merezhu Nablizhennya funkcij mozhe priskoryuvati navchannya u skinchennih zadachah zavdyaki tomu faktovi sho cej algoritm mozhe uzagalnyuvati poperednij dosvid do ranishe ne bachenih staniv Kvantuvannya She odna metodika zmenshuvannya prostoru staniv dij kvantuye mozhlivi znachennya Rozglyanmo priklad navchannya balansuvannyu palichki na palci Opis stanu v pevnij moment chasu vklyuchaye polozhennya palcya v prostori jogo shvidkist kut palichki ta yiyi kutovu shvidkist Ce daye chotirielementnij vektor sho opisuye odin stan tobto mittyevij znimok odnogo stanu zakodovanij v chotiri znachennya Problema v tim sho isnuye neskinchenno bagato mozhlivih staniv Shobi skorotiti mozhlivij prostir pravilnih dij mozhna priznachati bagato znachen odnomu koshikovi Tochna vidstan palcya vid jogo pochatkovogo polozhennya vid Neskinchennist do Neskinchennist ye ne vidomoyu a radshe chi daleko vin chi ni Blizko Daleko IstoriyaQ navchannya bulo vvedeno 1989 roku Dovedennya zbizhnosti bulo predstavleno Votkinsom ta Dayanom 1992 roku Votkins rozglyadav Navchannya z zatrimuvanih vinagorod ce nazva jogo doktorskoyi disertaciyi Vismoma rokami ranishe 1981 roku tu samu zadachu pid nazvoyu Navchannya z zatrimuvanim pidkriplennyam bulo rozv yazano poperechinnim adaptivnim masivom PAM angl Crossbar Adaptive Array CAA Bozhinovskogo Matricya pam yati W w a s bula takoyu zhe yak i Q tablicya Q navchannya visim rokiv po tomu Cya arhitektura vvela do navchannya z pidkriplennyam termin ocinyuvannya stanu angl state evaluation Algoritm poperechinnogo navchannya zapisanij matematichnim psevdokodom u tij praci na kozhnij iteraciyi vikonuye nastupne obchislennya U stani s vikonati diyu a Otrimati naslidkovij stan s Obchisliti ocinku stanu v s Utochniti poperechinnu cinnist w a s w a s v s Termin vtorinne pidkriplennya angl secondary reinforcement zapozicheno z teoriyi tvarinnogo navchannya shobi modelyuvati stani cherez zvorotne poshirennya cinnist stanu v s naslidkovoyi situaciyi poshiryuyetsya zvorotno do poperedno zustrinutih situacij PAM obchislyuye cinnosti staniv vertikalno a dij gorizontalno poperechina Demonstracijni diagrami yaki pokazuvali navchannya z zatrimuvanim pidkriplennyam mistili stani bazhani nebazhani ta nejtralni yaki obchislyuvalisya funkciyeyu ocinyuvannya stanu Cya sistema navchannya bula predtecheyu algoritmu Q navchannya 2014 roku Google DeepMind zapatentuvala zastosuvannya Q navchannya do glibokogo navchannya nazvane glibokim navchannyam z pidkriplennyam angl deep reinforcement learning abo glibokim Q navchannyam angl deep Q learning DQN yake mozhe grati v igri Atari 2600 na rivni lyudej ekspertiv VariantiGliboke Q navchannya Sistema DeepMind vikoristovuye gliboku zgortkovu nejronnu merezhu z sharami zamoshenimi zgortkovimi filtrami shobi imituvati efekti receptivnih poliv Koli dlya predstavlennya Q vikoristovuyut nelinijnij nablizhuvach funkcij takij yak nejronna merezha navchannya z pidkriplennyam ye nestijkim abo rozbizhnim Cya nestijkist pohodit vid korelyacij prisutnih u poslidovnosti sposterezhen togo faktu sho mali utochnennya Q mozhut znachno zminyuvati strategiyu ta rozpodil danih ta korelyacij mizh Q ta cilovimi znachennyami Cya metodika vikoristala povtorne progravannya dosvidu angl experience replay biologichno nathnenij mehanizm yakij zamist najneshodavnishoyi diyi vikoristovuye dlya prodovzhennya vipadkovij zrazok iz poperednih dij Ce usuvaye korelyaciyi v poslidovnosti sposterezhen ta robit plavnishimi zmini v rozpodili danih Iterativni utochnennya pidregulovuyut Q v bik cilovih znachen yaki onovlyuyut lishe chas vid chasu she dali zmenshuyuchi korelyaciyi z cillyu Podvijne Q navchannya Oskilki majbutnyu maksimalnu pribliznu cinnist diyi v Q navchanni ocinyuyut zastosovuyuchi tu zhe Q funkciyu sho j u strategiyi obirannya potochnoyi diyi v zashumlenih seredovishah Q navchannya mozhe inodi pereocinyuvati cinnosti dij upovilnyuyuchi navchannya Dlya vipravlennya cogo bulo zaproponovano variant nazvanij podvijnim Q navchannyam angl Double Q learning Podvijne Q navchannya ce algoritm navchannya z pidkriplennyam poza strategiyeyu v yakomu dlya ocinyuvannya cinnosti zastosovuyut strategiyu vidminnu vid tiyeyi sho zastosovuyut dlya obirannya nastupnoyi diyi Na praktici dvi okremi funkciyi cinnosti Q A displaystyle Q A ta Q B displaystyle Q B trenuyut vzayemno simetrichnim chinom vikoristovuyuchi okremi dosvidi Vidtak krok utochnennya podvijnogo Q navchannya ye takim Q t 1 A s t a t Q t A s t a t a t s t a t r t g Q t B s t 1 a r g m a x a Q t A s t 1 a Q t A s t a t displaystyle Q t 1 A s t a t Q t A s t a t alpha t s t a t left r t gamma Q t B left s t 1 mathop operatorname arg max a Q t A s t 1 a right Q t A s t a t right ta Q t 1 B s t a t Q t B s t a t a t s t a t r t g Q t A s t 1 a r g m a x a Q t B s t 1 a Q t B s t a t displaystyle Q t 1 B s t a t Q t B s t a t alpha t s t a t left r t gamma Q t A left s t 1 mathop operatorname arg max a Q t B s t 1 a right Q t B s t a t right Teper ocinyuvana cinnist znecinyuvanogo majbutnogo ocinyuyetsya iz zastosuvannyam vidminnoyi strategiyi sho rozv yazuye problemu pereocinyuvannya Cej algoritm bulo piznishe vidozmineno proyasniti 2015 roku ta poyednano z glibokim navchannyam yak v algoritmi glibokogo Q navchannya angl DQN sho dalo v rezultati algoritm podvijnogo glibokogo Q navchannya angl Double DQN yakij perevershuye pervinnij algoritm glibokogo Q navchannya Inshi Zatrimane Q navchannya angl Delayed Q learning ye alternativnim vtilennyam algoritmu interaktivnogo Q navchannya z imovirno priblizno pravilnim navchannyam angl probably approximately correct learning PAC Zhadibne GQ angl Greedy GQ ye variantom Q navchannya dlya zastosuvannya u poyednanni z linijnim Perevagoyu zhadibnogo GQ ye te sho zbizhnist garantovano navit koli dlya ocinki cinnosti dij vikoristovuyut nablizhennya funkciyi Div takozhMetod chasovih riznic Stan diya vinagoroda stan diya angl SARSA Iterativna dilema v yaznya Teoriya igorPrimitkiMelo Francisco S Convergence of Q learning a simple proof z dzherela 18 listopada 2017 Procitovano 23 lyutogo 2020 angl Matiisen Tambet 19 grudnya 2015 neuro cs ut ee amer Computational Neuroscience Lab Arhiv originalu za 7 kvitnya 2018 Procitovano 6 kvitnya 2018 angl Sutton Richard Barto Andrew 1998 MIT Press Arhiv originalu za 20 lyutogo 2020 Procitovano 4 bereznya 2020 angl Norvig Peter 2010 Artificial Intelligence A Modern Approach vid Third Prentice Hall s 649 ISBN 978 0136042594 angl Baird Leemon 1995 Residual algorithms Reinforcement learning with function approximation PDF ICML 30 37 angl Francois Lavet Vincent Fonteneau Raphael Ernst Damien 7 grudnya 2015 How to Discount Deep Reinforcement Learning Towards New Dynamic Strategies arXiv 1512 02011 cs LG angl Sutton Richard S Barto Andrew G 2 7 Optimistic Initial Values Arhiv originalu za 8 veresnya 2013 Procitovano 18 lipnya 2013 angl Shteingart Hanan Neiman Tal Loewenstein Yonatan May 2013 PDF Journal of Experimental Psychology General angl 142 2 476 488 doi 10 1037 a0029550 ISSN 1939 2222 PMID 22924882 Arhiv originalu PDF za 26 sichnya 2021 Procitovano 25 lyutogo 2020 angl Hasselt Hado van 5 bereznya 2012 Reinforcement Learning in Continuous State and Action Spaces U Wiering Marco Otterlo Martijn van red Reinforcement Learning State of the Art Springer Science amp Business Media s 207 251 ISBN 978 3 642 27645 3 angl Tesauro Gerald March 1995 Communications of the ACM 38 3 58 68 doi 10 1145 203330 203343 Arhiv originalu za 9 lyutogo 2010 Procitovano 8 lyutogo 2010 angl Watkins C J C H 1989 PDF Ph D thesis Cambridge University arhiv originalu PDF za 9 veresnya 2016 procitovano 4 bereznya 2020 angl Watkins and Dayan C J C H 1992 Q learning Machine Learning angl Bozinovski S 15 lipnya 1999 Crossbar Adaptive Array The first connectionist network that solved the delayed reinforcement learning problem U Dobnikar Andrej Steele Nigel C Pearson David W Albrecht Rudolf F red Artificial Neural Nets and Genetic Algorithms Proceedings of the International Conference in Portoroz Slovenia 1999 Springer Science amp Business Media s 320 325 ISBN 978 3 211 83364 3 angl Bozinovski S 1982 A self learning system using secondary reinforcement U Trappl Robert red Cybernetics and Systems Research Proceedings of the Sixth European Meeting on Cybernetics and Systems Research North Holland s 397 402 ISBN 978 0 444 86488 8 angl Barto A 24 lyutogo 1997 Reinforcement learning U Omidvar Omid Elliott David L red Neural Systems for Control Elsevier ISBN 978 0 08 053739 9 angl PDF US Patent Office 9 kvitnya 2015 Arhiv originalu PDF za 29 lipnya 2018 Procitovano 28 lipnya 2018 angl Mnih Volodymyr Kavukcuoglu Koray Silver David Rusu Andrei A Veness Joel Bellemare Marc G Graves Alex Riedmiller Martin Fidjeland Andreas K Feb 2015 Human level control through deep reinforcement learning Nature angl 518 7540 529 533 doi 10 1038 nature14236 ISSN 0028 0836 PMID 25719670 angl van Hasselt Hado 2011 Advances in Neural Information Processing Systems 23 2613 2622 Arhiv originalu PDF za 26 bereznya 2020 Procitovano 4 bereznya 2020 angl van Hasselt Hado Guez Arthur Silver David 2015 AAAI Conference on Artificial Intelligence 2094 2100 Arhiv originalu PDF za 6 lyutogo 2020 Procitovano 4 bereznya 2020 angl Strehl Alexander L Li Lihong Wiewiora Eric Langford John Littman Michael L 2006 PDF Proc 22nd ICML 881 888 Arhiv originalu PDF za 14 kvitnya 2021 Procitovano 4 bereznya 2020 angl Maei Hamid Szepesvari Csaba Bhatnagar Shalabh Sutton Richard 2010 PDF s 719 726 Arhiv originalu PDF za 8 veresnya 2012 Procitovano 25 sichnya 2016 angl PosilannyaWatkins C J C H 1989 Learning from Delayed Rewards PhD thesis Cambridge University Cambridge England 14 veresnya 2018 u Wayback Machine angl Strehl Li Wiewiora Langford Littman 2006 PAC model free reinforcement learning angl by Richard Sutton and Andrew S Barto an online textbook See angl Piqle a Generic Java Platform for Reinforcement Learning 6 sichnya 2019 u Wayback Machine angl Reinforcement Learning Maze 27 veresnya 2011 u Wayback Machine a demonstration of guiding an ant through a maze using Q learning angl Q learning work by Gerald Tesauro 4 chervnya 2011 u Wayback Machine angl
Топ