Conclusions
•Searchingforsparseand denserealizationsof CRN •LargescaleMILP problemshandledefficiently •LP-basedmethodsarepresentedforbothproblems •Alltheproposedmethodscanbe parallelized •Efficientcomputingof CRN realizationswithspecified structuralpropertiesis possible. Abstract Several kilo-processor and multiprocessor architectures can be reached on the market nowadays, the most important ones are the CUDA architecture of NVIDIA, FPGAs of Xilinx and graphical processors of AMD. We have presented our pipelined architecture, which can be the alternative of these systems, giving effective solutions for their known limitations. RACER kilo- processor architecture and programming procedure was invented with the following advantages: there is no global wiring, the number of computational data operations can be changed dynamically according to the task you want to perform, the data bus is conceptually large, and smart memory is used for solving the problem of coalescing. Our prediction is that its practical computing power per area rate is much greater than a conventional GPU system’s.RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
R es ults
•Thesesolutionmethodsenablesustohandlelargescale networks(1082 complexes, 1654 reactions)RACE R: Jo int ins truction -data -strea m based kilo -processo r architecture and progr amming procedu re
Ádám Rák and György Cserey Faculty of Information Technology, Pázmány Péter Catholic University Práter utca 50/a. Budapest, H-1083 Hungary, email: [rakad,cserey]@itk.ppke.hu Processing Element Array •Based on simple data-flow processing •Minimal control + arithmetic unit => Processing ElementRAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Smart Memory •Sorting the data is the job of the memory •Very high on-chip bandwidth enables fast sorting •Programmable sorting -> complex operations are possible •Turing Complete memory: high bandwidth + low arithmetic capability •Memory and processor bandwidth is always an ordered transfer -> maximal bandwidthRAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
RACER computing architecture •Redesigned both Memory and Processing architecture to achieve maximum performance •Eliminates classical cache •Eliminates huge amounts of memory on the processor array •Supports out of phase clocking for the cores -> wave clocking •Better split of the jobs between memory and computing array •Very simple core architecture -> PE(Processing Element)RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Processing Element •IO multiplexers to handle data-flow control (explained later) •Arithmetic Unit to process the data •Internal feedback to facilitate Multiply Accumulate (MAC) •Local read-only memory for storing arithmetic constant
RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Control-flow: loop •Implements typical pipe parallelism -> everything is parallel •Loop are cyclic pipelines •The entry/exit point handles filling/emptying •Usually causes congestion •Workload can be distributed between loops -> multi thread parallelism •Near 100% arithmetic efficiency •Optimal memory bandwidth usageRAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Arithmetic Unit example •Simple scalar arithmetic units •2 cycle pipelined operations •many cycle operations are also possible
RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Data-flow congestion •Dynamic pipeline model •Empty spaces disappear in case of congestion •Because propagate backwards the pipeline •Loops are possible to implementRAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Processing Element Array •The data-flow enters on a Port, coming from a periphery or memory •The processed data leaves the same way •Operation on the data are stationary, the data is moving •Program code can move on the same channel •Programming the array on-the-fly is just as fast as data transfer •Multiply, crossing data paths are possible -> multitasking
RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Data-flow propagation •The data flows in deep branching pipelines •Data propagates only half rate •Very robust against spatially out of phase clocks •Congestion is possible for many cycle operationsRAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Conclusions •Recent advancements make the physical realization possible •Very good fit to a wide area of algorithms (3D rendering, ray-tracing, databases, scientific computing, video compression) •Solves the memory bandwidth utilization problem •Much harder to program than traditional CPU architectures •Needs significant advancement in compiler technologyRAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Control-flow: PHI •The opposite of the branches •subtypes: –regular data-flow merge by priority –merge by a condition (control-flow) –balanced mergeRAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Control-flow: branch •The data-flow can branch: –according to a condition –according to congestion -> to distribute workloadRAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure
Ye t… • Hi gh ener gy ph ysi cs de tect or s n eeds • Pr ocessi ng lar ge amoun ts of i nf orma tion • Very short tim e r esponse • Compl ex structur es, man y sensor s wi th topogr aphi c in forma tion • Ef fici en t pr ocessi ng of da ta • Hug e bene fits e xpect ed f rom t he use of Cel lul ar / Mul tic or e t echni ques • Wher e ar e w e ?
20121106IRUN 2012 -Budapest, Magyarország3Gene ric De tect or Structur e
20121106IRUN 2012 -Budapest, Magyarország6Pr eamb le • Cel lul ar Neur al N etw ork s (cel lu lar t echni ques) asse ts • Par al lel har dw ar e im pl emen ta tion • Uni ver sal ity • Loc al sensor in forma tion pr ocessi ng • Massi vel y par al lel sy st ems • Mul tiCor e Pr ocessor s • Pr ocessor as one c ompl exi ty st ep abo ve the cel l • Massi ve par al lel sy st ems • Connecti vi ty
201211062IRUN 2012 -Budapest, Magyarország 20121106IRUN 2012 -Budapest, Magyarország5LHCb Trigger : T wo levels
Level 0 •reduce rate from 40MHz (bunch crossing) to 1MHz •fixed latency 4μs •hardware trigger •customelectronics HLT (High Level Trigger) •reduce rate from 1MHz to ~2kHz •full detector info available •software trigger •CPU farm (~16000 nodes) •latency dependanton number of CPUsMultic or e tr ends in high ene rgy ph ysics applic ations Xa vi er Vi lasí s Ca rdona (LIF AELS -URL ) N ik o Neuf el d (C ERN)
Thanks to FelicePantaleo(CERN) David Rohr (FIAS) 20121106IRUN 2012 -Budapest, Magyarország1De tect or s truct ur e and fun cti on s • Tr acki ng • Vert ex de tec tor s • Fol lo w parti cl e tr aje ct ori es • Parti cl e Iden tifi ca tion • Cal ori me ter s • Mea sur e the ener gy • Tri gg er • Da ta Acqui sit ion • Stri pp in g • Anal ysi s
• LHCb • ~2 mi lli on ch an ne ls • 40 MH z • 2 Tb/ s
201211064IRUN 2012 -Budapest, MagyarországTomo gr ap hy rec on stru cti on i n E SRF • Se ver al thousand s of imag es per sec ond for 3D r ec ons tructi on • 2000 pr ojec tions for 3 Gv oxe ls • CUD A o n an Nvi di a GPU • Tw o cha lleng es • Compu ta tion Ti me • Da ta thr oughput
20121106IRUN 2012 -Budapest, Magyarország9Synchrotron X- Ray Radiation 1454IEEETRANSACTIONSONNUCLEARSCIENCE,VOL.58,NO.4,AUGUST2011 Fig.9.Reconstructionoftomographicimagesrequiresreadingandwritingoflargeamountsofimagedata.Thejointread/writethroughputfordifferentstoragesystemsisdepictedintheleftchart.AWDC5000AACSSATAharddriveiscomparedwithanIntelX25-ESSDdisk,twosuchSSDdisksorganizedinastripingRAID-0,andavirtualRAMdisk.TheExt4filesystemisusedinallcases.TherightchartindicatestheratioofthetimespentincomputationsandI/Orespectively.TheresultswereobtainedusingtheGPUserverwithastoragesystemconsistingoftwoIntelX25-ESSDdrivesassembledinRAID-0. animplementationofadirectreadoutfromtheframegrabber.TheGPUserverisequippedwith96GBofmemoryandisabletostoreboththesourcedataandtheresultingimagedirectlyinthesystemmemory.Onthelongterm,strategiestoreducethedatastreamsuchasdatacompressionandframerejectareneededtofurtherboostperformance. VI.CONCLUSION Moderngraphiccardsprovideover1TFlop/sofcomputa-tionalpowerandcanbeefficientlyusedtospeedupscientificcomputationsbymorethanoneorderofmagnitude.ThefilteredbackprojectionalgorithmusedfortomographicreconstructioncanbeimplementedefficientlyandwithgoodscalabilityontheGPUarchitecture.BasedonthesourcecodeofPyHSTwehavedevelopedaGPU-basedimplementationofthealgorithmwhichisabletoexploitthecomputationalresourcesofmultipleGPUandCPUdevicessimultaneously.Theperformanceevaluationconfirmsthatareal-timeassessmentispossiblewiththecur-rentGPUarchitecture.UsingaGPUserverequippedwith4graphiccardsittakesonly40secondstoreconstructa3gi-gavoxelimagefrom2000projections.Thisisapproximately30timesfastercomparedtothetimeneededtoreconstructthesameimageusingan8-coreXeonserverwhichhasthesamepriceastheGPUserver.Evena$1000desktopequippedwithasingleGPUcardoutperformstheXeonserverbyafactoroffour.Wereducedthereconstructiontimesignificantly.However,wearefacingthechallengeofrapidlyloadingmultiplegiga-bytesofsourcedataintothesystemmemory.UsingthefastestSSDdisksassembledasRAID-0wewereabletoperformalldiskoperationsinapproximately2minutes.However,thistimeisstillinadequatecomparedtothecomputationtimeandcur-rentlyweareworkingtoreadimagesdirectlyintothesystemmemory,bypassingtheharddrive.Tobuildanoptimalhardwaresetup,wehavecomparedhigh-endNVIDIAcardsfromthecurrentandlastgenerationsavailableonthemarket.OurevaluationhasshownthatthefeaturesprovidedbytheprofessionalTeslaseriesandthenewlypresentedFermiarchitecturedonotimprovethetomographic reconstructionsignificantly.EquippedwithtwoGT200pro-cessors,theGeForceGTX295isthefastestadapteramongtheNVIDIAproducts.Forbetterperformanceitispossibletostackupto4ofsuchcardsinasinglesystem.Supermicrosuppliesthe7046GTfamilyofGPUservers,whichincludeupto4GPUcardsatfullx16speed,havetwoPCIex4extensionslotsfortheframegrabberandaRAIDadapter,andsupportupto192GBDDR3memorytoholdboththesourceprojectionsandtheresulting3Dimagecompletelyinthesystemmemory.ThecheaperalternativeisadesktopsystembasedonanAsusRam-pageIIIExtrememotherboard.IfequippedwithtwoGTX295cardsthissystemreaches1TFlop/sperformancewithapricebelow$2000.Withamaximumof48GBmemorysupported,itisstillpossibletocarryoutmostofthereconstructionsdirectlyinmemory.ThenewSATA3(6Gb/s)controllerhelpstoreduceI/OtimewhenusedtogetherwithaRAIDofSSDdrives.Forthefuture,wealsoplantoextendthispipelinedarchitec-turefrom2Dslice-by-slicereconstructionto3Dvolumerecon-struction,asneedede.g.forsynchrotronlaminography[37]. REFERENCES [1]Y.Wang,F.deCarlo,D.C.Mancini,I.McNulty,B.Tieman,J.Bres-nahan,J.I.I.Foster,P.Lange,G.vonLaszewski,C.Kesselmann,M.-H.Su,andM.Thibaux,―Ahigh-throughputX-raymicrotomog-raphysystemattheAdvancedPhotonSource,‖Rev.Sci.Instrum.,vol.72,no.4,pp.2062–2068,2001.[2]F.Garcia-Moreno,A.Rack,L.Helfen,T.Baumbach,S.Zabler,N.Babcsan,J.Banhart,T.Martin,C.Ponchut,andM.DiMichiel,―FastprocessesinliquidmetalfoamsinvestigatedbyhighspeedsynchrotronX-raymicro-radioscopy,‖Appl.Phys.Lett.,vol.92,p.134104,2008.[3]A.Rack,F.Garcia-Moreno,T.Baumbach,andJ.Banhart,―Syn-chrotron-basedradioscopyemployingspatio-temporalmicro-resolu-tionforstudyingfastphenomenainliquidmetalfoams,‖J.Synch.Radiat.,vol.16,pp.432–434,2009.[4]A.Rack,T.Weitkamp,S.B.Trabelsi,P.Modregger,A.Cecilia,T.dosSantosRolo,T.Rack,D.Haas,R.Simon,R.Heldele,M.Schulz,B.Mayzel,A.N.Danilewsky,T.Waterstradt,W.Diete,H.Riesemeier,B.R.Müller,andT.Baumbach,―Themicro-imagingstationoftheTopo-TomobeamlineattheANKAsynchrotronlightsource,‖Nucl.Instrum.MethodsPhys.Res.B,vol.267,no.11,pp.1978–1988,2009.[5]C.Hintermueller,F.Marone,A.Isenegger,andM.Stampanoni,―Imageprocessingpipelineforsynchrotronradiation-basedtomo-graphicmicroscopy,‖J.Synch.Radiat.,vol.17,pp.550–559,2010.[6]J.D.Owens,D.Luebke,N.Govindaraju,M.Harris,J.Kruger,A.E.Lefohn,andT.J.Purcell,―Asurveyofgeneral-purposecomputationongraphicshardware,‖Comput.Graph.Forum,vol.26,no.1,pp.80–113,2007.[7]J.D.Owens,M.Houston,D.Luebke,S.Green,J.E.Stone,andJ.C.Phillips,―GPUcomputing,‖Proc.IEEE,vol.96,no.5,pp.879–899,2008.[8]NVIDIA,CUDAZone[Online].Available:http://www.nvidia.com/object/cuda_home.html[9]Dr.Dobb’sSupercomputingfortheMasses[Online].Available:http://www.drdobbs.com/architecture-and-design/207200659[10]ICL,MAGMA—MatrixAlgebraonGPUandMulticoreArchitectures[Online].Available:http://icl.cs.utk.edu/magma[11]GpuCV—GPUAcceleratedComputerVision[Online].Available:https://picoforge.int-evry.fr/cgi-bin/twiki/view/Gpucv/Web/Web-Home[12]KhronosOpenCLWorkingGroup,TheOpenCLSpecification,2009[Online].Available:http://www.khronos.org/registry/cl/[13]A.C.KakandM.Slaney,PrinciplesofComputerizedTomographicImaging.Piscataway,NJ:IEEEPress,1988.[14]AdvancedTomographicMethodsinMaterialsResearchandEngi-neering,J.Banhart,Ed.London,U.K.:OxfordUniv.Press,2008.[15]A.HammersleyandA.Mirone,HighSpeedTomographyReferenceManual[Online].Available:http://www.esrf.eu/computing/scientific/HST/HST_REF
Chilingarianet al. TNS 58(4) 1447, 2011
Cu rr en t pr op osa ls : s tep s be yon d • Mos tly GP U appl ic ati ons • Rec ons tru cti on of tomogr aphi es in the Eur opean S ynchr otr on Radi ati on Faci lity • In tegr ati on of G PUs in to the ITMS fr ame w ork of da ta acqui sit ion in fus ion experi m en ts • Tr ack rec ons tru cti on simul ati on for P AN DA i n FA IR • Tri gg er of N A6 2 • RICH rec ons tru cti on f or A LICE
20121106IRUN 2012 -Budapest, Magyarország8Par al lel Oppo rtuni ties • On de tect or (mos tly CA ) • Tr ac ker s • Cal ori me ter s • Tri gg er S ys tems • Ev en t deci sion: one e ven t per nod e + di spa tcher • The G RID • Opport uni ty Aw ar enes s • CERN OPENLAB
20121106IRUN 2012 -Budapest, Magyarország7DA Q f or fusi on e xperi men ts • In tel lig en t T es t and Measur emen t S ys tem (IT MS) pl atf orm • Fr ame w ork of D AQ f or fusi on exp er imen ts • Sel f adap ting sampl ing ra te acc or di ng to input da ta r at e • Op timi se da ta v ol um e wi thout los s of pr eci sio n • GPU per formi ng adap tiv e al gori th m • Bandwi dth de tec tion • Base d on FFT • Ser ver bo x wi th 2 CPU s and 1 GPU (N VIDIA TE SLA) • Com par ed t o di rect La bVi ew acqui siti on
20121106IRUN 2012 -Budapest, Magyarország12Pan da de tect or Si mul ati on
20121106IRUN 2012 -Budapest, Magyarország15 Al-Turanyet al. JoPConference series219(2010) 042001Tomo gr ap hy rec on stru cti on i n E SRF
20121106IRUN 2012 -Budapest, Magyarország111452IEEETRANSACTIONSONNUCLEARSCIENCE,VOL.58,NO.4,AUGUST2011 Fig.3.The2000projections(left)wereusedtoreconstructa3gigavoxelimage ofporouspolyethylenegrains(right).Thecomputationcomplexityoftherecon- structionisabout54TFlop.Atotalof35GBofdatamustbereadandstored. Fig.4.PerformanceevaluationofCcompilers(left)andFFTlibraries(right). ThetestwasrunonaDesktopsystemandasingleCPUversionofPyHSTwas deployed.Onlytheperformanceofthebackprojectionstepwasmeasuredin thecompilerbenchmarkandonlytheperformanceofthefilteringstep—inthe FFTbenchmark.ForallcompilerstheSSEvectorizationwasswitchedon.With gccandclangthefollowingoptimizationflagswereused:-O3-march=nocona -mfpmath=sse.WiththeIntelCCompilertheemployedoptimizationflagswere: -O3-xS.TheFFTW3librarywascompiledwithSSEsupport.Theperformance ismeasuredinGFlop/sandlargervaluescorrespondtobetterresults. Fig.5.Evaluationofthetimeneededtoreconstructthesampledatasetusing differenthardwareplatforms.TheCPU-basedreconstructionwasperformedon theXeonserverandonlyGPUswereusedforallotherplatforms.Shortertimes correspondtobetterresults.
filte rin g an d all sou rc es ar e co m pi led w ith gcc- 4. 4 us ing -O3 -ma rc h= noc ona -m fpm at h= ss e op tim iz atio n flag s. D. Re cons tr uc tion Pe rf or m anc e Th e pe rf or ma nc e of th e fiv e te st sy st em s is gi ven in Fig .5 .If th e GPU is used for ima ge re con st ru ct ion ev en a ch eap de skt op is app ro xi ma te ly 4 tim es fas ter th an an ex pen si ve Xeo n ser ver . It tak es on ly 40 seco nd s to re con st ru ct a 3 gig av ox el ima ge from 20 00 pr oj ec tion sus ing th e GPU ser ver eq ui pp ed w ith four gra phi c car ds .At a pr ic e com pa ra bl e w ith th e Xeo n ser ver ,th e GPU ser ver pe rf or ms re con st ru ct ion 30 tim es fas ter . Ou r im ple m en ta tio n scal es w ell. Acco rd in g to Fig .6 ,o nly 2. 6% of th e ma xi mu m po ssi bl e pe rf or ma nc e is lo st w hil e th e
Fig.6.ScalabilityevaluationoftheGPU-basedreconstruction.Thetestwas executedonanNVIDIATeslaS1070systemwith1to4TeslaC1060GPUs.The performanceofthecompletereconstructionprocessincludingthedatatransfer betweenhostandGPUmemoryismeasured.Thedashedlineindicatesmax- imumpossibleperformance(asifFlopratesofallGPUswouldbejustsummed up).Theevaluationshowsthatourimplementationscaleslinearlywithonly 2.6%ofperformancelossinthecaseifall4GPUsareenabled. Fig.7.MFlop/sperdollarefficiencyofthetestedhardwareplatformsforre- constructionusingthebackprojectionalgorithm(filteringisomitted).Higher valuescorrespondtobetterresults.TheactualperformanceinGFlop/sandthe platformpricesareshowninthechart.Tes la sy st em is scal ed up from on e to four GPUs .T he ef fici en c y of th e pl at fo rm s in te rm s of M Flo p/s per do lla r ra tio is sho wn in Fig .7 .It is eas y to see th at th e de skt op pr od uc ts ar e su pe- rior to th e ser ver bas ed sol ut ion sus ing th is me tri c. Fur the rm or e, ow in g to th e go od sc ala bi lity of al go rit hm im ple m en ta tio n, th e Adv anc ed D es kt op m ay be fu rthe ren han ced by add ing tw o more gra phi c car ds . E. Ev al uat ion of NVI D IA H ar dwar e NVI D IA del iv er s m ultip le ge ne ra tion s of gra phi c car ds in co ns um er an d pro fe ss io na l ver si on s. In th is se ctio n th e per - fo rm anc e of th e hi gh -e nd gra phi c car ds from th e tw o la te st ge ne ra tion s is asse sse d. GeF or ce G TX 29 5 in clu de s tw o G T2 00 pr oc esso rs. GeF or ce G TX 28 0 is its sin gle pr oc es sor co un ter - par t. NVI D IA Te sla C1060 is a pro fe ss io na ls ol ut ion us ing th e G T2 00 ar ch itect ur e. It has more me mo ry th an co ns um er pr od- uct s bu tat re du ced cl ock rat es .GeF or ce G TX 48 0 is th e cur re nt NVI D IA flag sh ip pr od uc t bas ed on th e la te st GF1 00 (F erm i)
Scalability of the GPU implementation (5500$)
(8000$) (1000$)
(12000$)
(1800$)
GP U ser ver i s 30 ti mes fas ter than X eon ser ver PAND A de tect or si mul ati on • FAIR f aci lity • PAND A de tect or simu la tion : tr ac ker • Tes tin g se ver al con figu ra tion s
20121106IRUN 2012 -Budapest, Magyarország14Tomo gr ap hy rec on stru cti on i n E SRF • 5 c omput er c on figur ati ons tes ted • Xeon Ser ver • Si m pl e Desk top wi th 1 GPU • Adv anced Desk top wi th 2 GPUs • NV IDIA T esl a Ser ver wi th 4 GP Us • Super mi cr o GPU Ser ver wi th 4 GPUs
20121106IRUN 2012 -Budapest, Magyarország10DA Q f or fusi on e xperi men ts
20121106IRUN 2012 -Budapest, Magyarország13Nieto et al.TNS 58(4) 1714, 2011