Підтримка
www.wikidata.uk-ua.nina.az
Apache Spark visokoproduktivne rishennya dlya obrobki danih sho zberigayutsya v klasteri Hadoop U porivnyanni z nadanim u Hadoop mehanizmom MapReduce Spark zabezpechuye u 100 raziv bilshu produktivnist pri obroblenni danih v pam yati j 10 raziv pri rozmishenni danih na diskah Rushij mozhe vikonuvatisya na vuzlah klastera Hadoop yak za dopomogoyu Hadoop YARN tak i u vidokremlenomu rezhimi Pidtrimuyetsya obroblennya danih u shovishah HDFS Cassandra ta bud yakomu formati vvedennya Hadoop InputFormat Apache SparkTipprogramnij karkas d i hmarni obchislennyaAvtorMatej ZahariyaRozrobnikApache Software FoundationPlatformavirtualna mashina JavaOperacijna sistemaWindows Linux i macOSMova programuvannyaScala 1 2 Java 2 Python 2 R 2 SQL 2 i Java Database Connectivity 2 Stan rozrobkiaktivnijLicenziyad i Licenziya BSDRepozitorijhttps github com apache spark https gitbox apache org repos asf spark gitVebsajtspark apache org Mediafajli u Vikishovishi Spark mozhe vikoristovuvatisya yak u tipovih scenariyah obroblennya danih shozhih na MapReduce tak i dlya realizaciyi specifichnih metodiv takih yak potokove obroblennya SQL interaktivni ta analitichni zapiti rishennya zadach mashinnogo navchannya i robota z grafami Programi dlya obroblennya danih mozhut stvoryuvatisya na movah Scala Java Python ta R Spark pislya perebuvannya v inkubatori stav pervinnim proyektom Apache Software Foundation vid lyutogo 2014 Z kompanij kotri vikoristovuyut Spark vidznachayutsya Alibaba Databricks IBM Intel Yahoo Cisco Systems U zhovtni 2014 roku Apache Spark vstanoviv svitovij rekord pri sortuvanni 100 terabajt danih Zgidno z opituvannyam O Reilly u 2015 roci 17 doslidnikiv danih vikoristovuyut Apache Spark OpisZastosunok Spark skladayetsya z procesu drajvera angl driver process ta bagatoh procesiv vikonavciv angl executor processes Drajver ye sercem zastosunku Spark i vikonuye nastupni funkciyi Zberigaye ta obroblyaye informaciyu pro stan zastosunku Vidpovidaye na zapiti koristuvackih program Analizuye ta rozpodilyaye zavdannya mizh vikonavcyami ta poryadok yih vikonannya Vikonavci natomist vikonuyut zavdannya ta zvituyut pro yih vikonannya i svij stan drajveru Tak yak drajver i vikonavci ye zvichajnimi procesami Spark mozhe pracyuvati v psevdo rozpodilenomu lokalnomu rezhimi angl local mode v yakomu vsi voni zapuskayutsya na odnomu komp yuteri z odnim vikonavcem na kozhne yadro procesora i robota klastera lishe emulyuyetsya Ce korisno dlya rozrobki i testuvannya Zazvichaj cej rezhim vikoristovuyetsya dlya rozrobki abo testuvannya koli rozpodilene zberigannya ne potribne i natomist vikoristovuyetsya lokalna fajlova sistema Apache Spark potrebuye dlya rozgortannya klastera en yakij kontrolyuye fizichni mashini i vidilyaye resursi ta en V Spark menedzherom mozhe pracyuvati yak vbudovanij menedzher Spark angl Spark standalone cluster manager tak i YARN abo Apache Mesos Dlya rozpodilenogo zberigannya Spark mozhe z yednuvatis z najriznomanitnishimi sistemami vklyuchayuchi Hadoop Distributed File System Cassandra OpenStack Swift Amazon S3 Kudu abo realizaciyu originalnogo rishennya Spark dozvolyaye pisati zapiti nadayuchi API nastupnimi movami programuvannya Scala Python Java SQL ta R SPARK HOME bin spark shell zapuskaye REPL na Scala SPARK HOME bin pyspark zapuskaye REPL na Python API do procesu drajvera nazivayetsya sesiyeyu Spark angl Spark Session i dostupne yak zminna spark v obolonkah dlya Scala ta Python DataFrame Osnovna struktura danih z yakoyu pracyuye Spark DataFrame Vona mistit tablicyu yaka skladayetsya z ryadkiv ta kolonok Spisok kolonok ta yihnih tipiv nazivayut shemoyu yaku mozhna podivitis viklikavshi metod df printSchema Analogichni strukturi danih ye v movi R ta biblioteci pandas ale v Spark vin vidriznyayetsya tim sho ye rozpodilenim na kilka rozdiliv angl partitions Rozdil ce nabir ryadkiv DataFrame na odnij fizichnij mashini klasteru DataFrame v Spark nezminni ale do nogo mozhna zastosuvati peretvorennya angl transformations shob utvoriti novij DataFrame Napriklad evenRows myData where number 2 0 utvoryuyemo DataFrame evenRows z datafrejmu myData Peretvorennya linivi i ne vikonuyutsya a lishe dodayutsya do planu obchislen doti doki koristuvach ne poprosit pro yakus diyu angl action Diya mozhe vivesti dani v konsol fajl chi bazu danih abo zibrati yih v ob yekti movi yakoyu pisali zapit Priklad diyi evenRows count pidrahunok kilkosti ryadkiv v DataFrame Diya zapuskaye zavdannya angl job yake vikonuye vsi neobhidni peretvorennya ta diyu zgidno optimizovanogo planu obchislen Proces vikonannya zavdannya mozhna monitoriti z vebinterfejsu koristuvacha Spark yakij znahoditsya za adresoyu http localhost 4040 v lokalnomu rezhimi abo na vuzli klastera na yakomu zapushenij drajver SparkSQL Krim metodiv ob yekta DataFrame z danimi mozhna pracyuvati za dopomogoyu Spark SQL Dlya cogo Spark mozhe stvoriti z DataFrame rozriz danih virtualnu tablicyu Ce robitsya viklikom myDataFrame createOrReplaceTempView nazva tablici Pislya chogo mozhna vikonati zapit v REPL cherez metod sql ob yekta sesiyi spark sql SELECT FROM nazva tablici RDD API Apache Spark zoseredzheno navkolo strukturi danih yaka nazivayetsya pruzhni rozpodileni nabori danih angl resilient distributed dataset RDD vidmovostijka multimnozhina elementiv danih yaku mozhna lishe chitati rozpodilena ponad klasterom mashin Vona bula rozroblena u vidpovid na obmezhennya paradigmi programuvannya MapReduce yaka nav yazuye pevnu linijnu strukturu en dlya rozpodilenoyi programi programi MapReduce chitayut vhidni dani z diska zastosovuyut do vsih danih en en rezultati zastosuvannya map i zberigayut rezultat na disku RDDs v Spark funkcionuyut yak en dlya rozpodilenih program yakij nadaye obmezhenu navmisne formu rozpodilenoyi spilnoyi pam yati Dostupnist RDD pidtrimuye realizaciyu yak iterativnih algoritmiv yaki zvertayutsya do danih bagato raziv v cikli tak i interaktivnogo rozviduvalnogo analizu tobto povtoryuvani zapiti do danih yak u SKBD en takih zastosunkiv porivnyano z realizaciyami na MapReduce yaki tipovi dlya steku Apache Hadoop mozhe buti menshoyu na kilka poryadkiv Do klasu iterativnih algoritmiv vhodyat trenuvalni algoritmi sistem mashinnogo navchannya vikoristannya yakih ye pershoprichinoyu rozrobki Apache Spark Spark Core Spark Core ce osnova vsogo proyektu Vin nadaye dispetcherizaciyu zadach planuvannya ta bazovij vvid vivid yaki nadayutsya cherez prikladnij programnij interfejs dlya Java Python Scala ta R zoseredzhenij navkolo abstrakciyi RDD Java API dostupne dlya inshih mov JVM ale mozhe takozh vikoristovuvatis deyakimi ne JVM movami takimi yak Julia yaki mozhut priyednuvatis do JVM Cej interfejs mavpuye funkcijnu model programuvannya vishogo poryadku programa drajver aktivuye paralelni operaciyi taki yak map filter abo reduce nad RDD peredayuchi funkciyu Spark yakij viznachaye grafik vikonannya funkciyi na klasteri Ci operaciyi i dodatkovi taki yak Join i prijmayut RDD yak vhidni dani ta utvoryuyut novi RDD RDD nezminni a operaciyi nad nimi linivi stijkist do vidmov zabezpechuyetsya zapam yatovuvannyam pohodzhennya kozhnogo RDD yak poslidovnosti operacij yakimi jogo otrimano shob u vipadku vtrati danih jogo mozhna bulo obchisliti nanovo RDD mozhut mistiti bud yakij tip ob yektiv Python Java chi Scala Okrim funkcionalnogo stilyu programuvannya z RDD Spark nadaye dva obmezheni vidi spilnih zminnih shirokomovni zminni angl broadcast variables stosuyutsya danih lishe dlya chitannya yaki mayut buti dostupnimi na vsih vuzlah ta akumulyatori angl accumulators yaki mozhna vikoristovuvati dlya programuvannya agregaciyi v imperativnomu stili Tipovim prikladom funkcijnogo programuvannya yake zoseredzhuyetsya na RDD ye nastupna programa na Scala yaka obchislyuye chastoti vsih sliv sho zustrichayutsya v nabori tekstovih fajliv i rozdrukovuye ti sho traplyayutsya najchastishe Kozhna map flatMap variant map ta reduceByKey prijmaye anonimnu funkciyu sho vikonuye prostu operaciyu nad odnim elementom danih abo paroyu elementiv i vikoristovuye yiyi shob peretvoriti RDD v novij RDD val conf new SparkConf setAppName wiki test stvoriti ob yekt konfiguraciyi Spark val sc new SparkContext conf Stvoriti spark context val data sc textFile shlyah do yakoyis direktoriyi Prochitati fajli z yakoyis direktoriyi v RDD sho skladayetsya z par nazva fajlu vmist val tokens data flatMap split Rozdiliti kozhen fajl na spisok tokeniv sliv val wordFreq tokens map 1 reduceByKey Prisvoyiti kilkist 1 kozhnomu tokenu a todi pidrahuvati sumi dlya kozhnogo okremogo slova wordFreq sortBy s gt s 2 map x gt x 2 x 1 top 10 Otrimati najpopulyarnishi 10 sliv Perestaviti slovo i chastotu abi vidsortuvati za chastotoyu Spark SQL Komponent Spark SQL pracyuye poverh Spark Core ta zabezpechuye abstrakciyu nad danimi DataFrames yiyi do vihodu versiyi Spark 1 3 nazivali SchemaRDDs sho nadaye pidtrimku strukturovanih ta en Spark SQL nadaye predmetno oriyentovanu movu DSL dlya manipulyaciyi datafrejmami v Scala Java chi Python Takozh dodaye pidtrimku movi SQL z interfejsom komandnogo ryadka ta ODBC JDBC serverami Hocha datafrejmam brakuye perevirki tipiv na etapi kompilyaciyi yaka dostupna dlya RDD pochinayuchi z Spark 2 0 Spark SQL takozh pidtrimuye strogo tipizovanij DataSet import org apache spark sql SQLContext val url jdbc mysql yourIP yourPort test user yourUsername password yourPassword URL dlya servera bazi danih val sqlContext new org apache spark sql SQLContext sc Stvoriti ob yekt sql kontekstu val df sqlContext read format jdbc option url url option dbtable people load df printSchema Pokazuye shemu DataFrame val countsByAge df groupBy age count Rahuye kilkist lyudej kozhnogo viku Spark Streaming Spark Streaming vikoristovuye mozhlivist shvidkogo planuvannya Spark Core shob zdijsnyuvati potokovu analitiku Vin spozhivaye dani v mini paketah i vikonuye RDD peretvorennya nad timi mini paketami Taka shema dozvolyaye vikoristovuvati dlya potokovoyi analitiki takij samij kod sho j dlya analitiki nad paketami danih takim chinom dozvolyayuchi prostu realizaciyu lyambda arhitekturi Prote cya zruchnist oplachuyetsya latentnistyu yaka dorivnyuye trivalosti obrobki mini paketu Isnuyut inshi rushiyi obrobki potokovih danih yaki obroblyayut kozhnu podiyu okremo a ne v mini paketah taki yak en ta potokovij komponent en Spark Streaming maye vbudovanu mozhlivist chitannya z Apache Kafka en Twitter ZeroMQ Kinesis ta soketiv TCP IP Biblioteka mashinnogo navchannya MLlib Spark MLlib ce rozpodilenij frejmvork dlya mashinnogo navchannya pobudovanij na osnovi Spark Core Velikoyu miroyu zavdyaki rozpodilenij arhitekturi Spark yaka aktivno vikoristovuye operativnu pam yat vona v 9 raziv shvidsha za rishennya na osnovi diskiv yaki vikoristovuyutsya v en zgidno testiv produktivnosti zdijsnenih rozrobnikami MLlib nad zadacheyu programuvannya Alternating Least Squares ALS i do togo yak Mahout otrimav interfejs do Spark i masshtabuyetsya krashe za en Bagato tipovih statistichnih algoritmiv ta algoritmiv mashinnogo navchannya realizovani i postachayutsya razom z MLlib sho sproshuye pobudovu velikih pajplajniv mashinnogo navchannya napriklad pidsumovuvalni statistiki korelyaciyi stratifikovani probi perevirka gipotez generaciya vipadkovih danih klasifikaciya ta regresiya oporno vektorni mashini logistichna regresiya linijna regresiya dereva rishen nayivnij bayesiv klasifikator Zasobi kolaborativnoyi filtraciyi vklyuchno z alternating least squares ALS metodi klasternogo analizu taki yak k serednih ta en LDA metodi znizhennya rozmirnosti taki yak singulyarnij rozklad matrici SVD ta metod golovnih komponent PCA funkciyi vidilyannya oznak ta peretvorennya danih algoritmi optimizaciyi taki yak metod stohastichnogo gradiyenta L BFGS GraphX GraphX ce rozpodilenij frejmvork obrobki grafiv na osnovi Apache Spark Cherez te sho vin pobudovanij na osnovi nezminnih RDD grafi tezh nezminni tomu GraphX nepridatnij dlya roboti z grafami sho onovlyuyutsya tim bilshe tranzakciyami yak u grafovih bazah danih GraphX nadaye dva okremi API dlya realizaciyi masovo paralelnih algoritmiv takih yak PageRank Pregel podibne ta bilsh zagalne MapReduce API Na vidminu vid poperednika Bagel pidtrimka yakogo formalno zavershilas z versiyi 1 6 GraphX maye povnu pidtrimku grafiv z atributami grafiv v yakih do vershin ta reber mozhut priyednuvatis atributi GraphX mozhna rozglyadati yak Spark alternativu en yakij vikoristovuye diskovij MapReduce Hadoop Yak i Apache Spark GraphX spershu rozpochavsya yak doslidnickij proyekt AMPLab Berkli ta Databricks i piznishe buv peredanij v Apache Software Foundation v proyekt Spark InfrastrukturaSajt Spark Packages stanom na veresen 2017 go mistit bilsh nizh 360 paketiv yaki rozshiryuyut funkcionalnist Spark dozvolyayuchi jomu zapisuvati ta chitati dani z riznih dzherel i formativ mistyat realizaciyi riznih algoritmiv mashinnogo navchannya ta roboti z grafami ta inshe IstoriyaProyekt Spark buv pochatij Mateyem Zahariyeyu AMPLab universitetu Berkli v 2009 a jogo kod vidkritij v 2010 pid licenziyeyu BSD U 2013 proyekt buv podarovanij Apache Software Foundation i zminiv svoyu licenziyu na Apache 2 0 U lyutomu 2014 Spark stav proyektom verhnogo rivnya v Apache Software Foundation V listopadi 2014 kompaniya Databricks zasnovana Mateyem vstanovila za dopomogoyu Spark novij svitovij rekord z sortuvannya velikih obsyagiv danih Spark mav ponad 1000 uchasnikiv proyektu v 2015 sho zrobilo jogo odnim z najaktivnishih proyektiv v Apache Software Foundation i odnim z najaktivnishih proyektiv velikih danih z vidkritim kodom Zvazhayuchi na populyarnist platformi taki platni programi yak en ta bezkoshtovni tovaristva taki yak en pochali proponuvati specializovani navchalni kursi Versiya Data pershogo relizu Ostannya minorna versiya Data relizu Old version no longer supported 0 5 2012 06 12 0 5 1 2012 10 07 Old version no longer supported 0 6 2012 10 14 0 6 2 2013 02 07 Old version no longer supported 0 7 2013 02 27 0 7 3 2013 07 16 Old version no longer supported 0 8 2013 09 25 0 8 1 2013 12 19 Old version no longer supported 0 9 2014 02 02 0 9 2 2014 07 23 Old version no longer supported 1 0 2014 05 26 1 0 2 2014 08 05 Old version no longer supported 1 1 2014 09 11 1 1 1 2014 11 26 Old version no longer supported 1 2 2014 12 18 1 2 2 2015 04 17 Old version no longer supported 1 3 2015 03 13 1 3 1 2015 04 17 Old version no longer supported 1 4 2015 06 11 1 4 1 2015 07 15 Old version no longer supported 1 5 2015 09 09 1 5 2 2015 11 09 Old version no longer supported 1 6 2016 01 04 1 6 3 2016 11 07 Old version no longer supported 2 0 2016 07 26 2 0 2 2016 11 14 Old version no longer supported 2 1 2016 12 28 2 1 3 2018 06 26 Old version no longer supported 2 2 2017 07 11 2 2 3 2019 01 11 Old version no longer supported 2 3 2018 02 28 2 3 4 2019 09 09 Older version yet still supported 2 4 LTS 2018 11 02 2 4 8 2021 05 17 Current stable version 3 0 2020 06 18 3 0 3 2021 06 01 Current stable version 3 1 2021 03 02 3 1 3 2022 02 18 Current stable version 3 2 2021 10 13 3 2 1 2022 01 26 Current stable version 3 3 2022 06 16 3 3 0 2022 06 16 Legenda Stara versiyaStara versiya vse she pidtrimuyetsyaOstannya versiyaOstannya beta versiyaMajbutnij relizVinoskiThe apache spark Open Source Project on Open Hub Languages Page 2006 d Track Q124688 https projects apache org json projects spark json Spark poluchil status pervichnogo proekta Apache 6 bereznya 2014 u Wayback Machine opennet ru 27 02 2014 Arhiv originalu za 29 chervnya 2015 Procitovano 28 sichnya 2015 Arhiv originalu za 27 veresnya 2015 Procitovano 8 zhovtnya 2015 Zaharia ta Chambers 2017 Arhiv originalu za 12 veresnya 2017 Procitovano 11 veresnya 2017 apache org Apache Foundation 18 grudnya 2014 Arhiv originalu za 19 sichnya 2015 Procitovano 18 sichnya 2015 Arhiv originalu za 24 bereznya 2015 Procitovano 15 veresnya 2017 Arhiv originalu za 30 veresnya 2017 Procitovano 15 veresnya 2017 Doan DuyHai 10 veresnya 2014 Cassandra User Spisok rozsilki Arhiv originalu za 30 travnya 2015 Procitovano 21 listopada 2014 Zaharia Matei Chowdhury Mosharaf Franklin Michael J Shenker Scott Stoica Ion PDF USENIX Workshop on Hot Topics in Cloud Computing HotCloud Arhiv originalu PDF za 10 kvitnya 2016 Procitovano 15 veresnya 2017 Zaharia Matei Chowdhury Mosharaf Das Tathagata Dave Ankur Ma Justin McCauley Murphy J Michael Shenker Scott Stoica Ion PDF USENIX Symp Networked Systems Design and Implementation Arhiv originalu PDF za 12 serpnya 2017 Procitovano 15 veresnya 2017 Xin Reynold Rosen Josh Zaharia Matei Franklin Michael Shenker Scott Stoica Ion June 2013 PDF Arhiv originalu PDF za 9 serpnya 2017 Procitovano 15 veresnya 2017 Harris Derrick 28 chervnya 2014 Arhiv originalu za 24 zhovtnya 2017 Procitovano 15 veresnya 2017 Arhiv originalu za 11 chervnya 2018 Procitovano 25 veresnya 2017 a href wiki D0 A8 D0 B0 D0 B1 D0 BB D0 BE D0 BD Cite web title Shablon Cite web cite web a Obslugovuvannya CS1 Storinki z tekstom archived copy yak znachennya parametru title posilannya Arhiv originalu za 8 zhovtnya 2017 Procitovano 7 zhovtnya 2017 a href wiki D0 A8 D0 B0 D0 B1 D0 BB D0 BE D0 BD Cite web title Shablon Cite web cite web a Obslugovuvannya CS1 Storinki z tekstom archived copy yak znachennya parametru title posilannya www pluralsight com Arhiv originalu za 9 zhovtnya 2017 Procitovano 20 listopada 2016 Shapira Gwen 29 serpnya 2014 cloudera com Cloudera Arhiv originalu za 14 chervnya 2016 Procitovano 17 chervnya 2016 re use the same aggregates we wrote for our batch application on a real time data stream IEEE May 2016 Arhiv originalu PDF za 5 bereznya 2020 Procitovano 9 zhovtnya 2017 Kharbanda Arush 17 bereznya 2015 sigmoid com Sigmoid Sunnyvale California IT product company Arhiv originalu za 15 serpnya 2016 Procitovano 7 lipnya 2016 Sparks Evan Talwalkar Ameet 6 serpnya 2013 slideshare net Spark User Meetup San Francisco California Arhiv originalu za 24 chervnya 2015 Procitovano 10 lyutogo 2014 spark apache org Arhiv originalu za 19 zhovtnya 2017 Procitovano 18 sichnya 2016 Malak Michael 14 chervnya 2016 slideshare net sparksummit org Arhiv originalu za 17 serpnya 2016 Procitovano 11 lipnya 2016 Malak Michael 1 lipnya 2016 Manning s 89 ISBN 9781617292521 Arhiv originalu za 21 bereznya 2022 Procitovano 9 zhovtnya 2017 Pregel and its little sibling aggregateMessages are the cornerstones of graph processing in GraphX algorithms that require more flexibility for the terminating condition have to be implemented using aggregateMessages Malak Michael 14 chervnya 2016 slideshare net sparksummit org Arhiv originalu za 17 serpnya 2016 Procitovano 11 lipnya 2016 Malak Michael 1 lipnya 2016 Manning s 9 ISBN 9781617292521 Arhiv originalu za 21 bereznya 2022 Procitovano 9 zhovtnya 2017 Giraph is limited to slow Hadoop Map Reduce Gonzalez Joseph Xin Reynold Dave Ankur Crankshaw Daniel Franklin Michael Stoica Ion Oct 2014 PDF Arhiv originalu PDF za 7 grudnya 2014 Procitovano 9 zhovtnya 2017 Arhiv originalu za 4 veresnya 2017 Procitovano 11 veresnya 2017 a href wiki D0 A8 D0 B0 D0 B1 D0 BB D0 BE D0 BD Cite web title Shablon Cite web cite web a Obslugovuvannya CS1 Storinki z tekstom archived copy yak znachennya parametru title posilannya apache org Apache Software Foundation 27 lyutogo 2014 Arhiv originalu za 17 bereznya 2015 Procitovano 4 bereznya 2014 Arhiv originalu za 15 travnya 2015 Procitovano 14 veresnya 2017 Arhiv originalu za 7 grudnya 2014 Procitovano 14 veresnya 2017 Venture Beat Arhiv originalu za 15 lyutogo 2016 Procitovano 21 lyutogo 2016 Spark News apache org originalu za 25 serpnya 2021 Spark News apache org Spark 3 1 3 released spark apache org originalu za 18 chervnya 2022 LiteraturaZaharia Matei Chambers Bill 2017 ISBN 978 1 4919 1221 8 Arhiv originalu za 11 veresnya 2017 Procitovano 11 veresnya 2017 PosilannyaOficijnij sajt Virtualnij klaster Spark u kontejnerah Docker Damji Jules Wenig Brooke Das Tathagata Lee Denny 2020 Learning Spark PDF English O Reilly Media ISBN 978 1 492 05004 9
Топ