RACER: Joint instruction-data-stream based kilo-processor architecture and programming procedure

Conclusions

•Searchingforsparseand denserealizationsof CRN •LargescaleMILP problemshandledefficiently •LP-basedmethodsarepresentedforbothproblems •Alltheproposedmethodscanbe parallelized •Efficientcomputingof CRN realizationswithspecified structuralpropertiesis possible. Abstract Several kilo-processor and multiprocessor architectures can be reached on the market nowadays, the most important ones are the CUDA architecture of NVIDIA, FPGAs of Xilinx and graphical processors of AMD. We have presented our pipelined architecture, which can be the alternative of these systems, giving effective solutions for their known limitations. RACER kilo- processor architecture and programming procedure was invented with the following advantages: there is no global wiring, the number of computational data operations can be changed dynamically according to the task you want to perform, the data bus is conceptually large, and smart memory is used for solving the problem of coalescing. Our prediction is that its practical computing power per area rate is much greater than a conventional GPU system’s.

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

R es ults

•Thesesolutionmethodsenablesustohandlelargescale networks(1082 complexes, 1654 reactions)

RACE R: Jo int ins truction -data -strea m based kilo -processo r architecture and progr amming procedu re

Ádám Rák and György Cserey Faculty of Information Technology, Pázmány Péter Catholic University Práter utca 50/a. Budapest, H-1083 Hungary, email: [rakad,cserey]@itk.ppke.hu Processing Element Array •Based on simple data-flow processing •Minimal control + arithmetic unit => Processing Element

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Smart Memory •Sorting the data is the job of the memory •Very high on-chip bandwidth enables fast sorting •Programmable sorting -> complex operations are possible •Turing Complete memory: high bandwidth + low arithmetic capability •Memory and processor bandwidth is always an ordered transfer -> maximal bandwidth

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

RACER computing architecture •Redesigned both Memory and Processing architecture to achieve maximum performance •Eliminates classical cache •Eliminates huge amounts of memory on the processor array •Supports out of phase clocking for the cores -> wave clocking •Better split of the jobs between memory and computing array •Very simple core architecture -> PE(Processing Element)

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Processing Element •IO multiplexers to handle data-flow control (explained later) •Arithmetic Unit to process the data •Internal feedback to facilitate Multiply Accumulate (MAC) •Local read-only memory for storing arithmetic constant

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Control-flow: loop •Implements typical pipe parallelism -> everything is parallel •Loop are cyclic pipelines •The entry/exit point handles filling/emptying •Usually causes congestion •Workload can be distributed between loops -> multi thread parallelism •Near 100% arithmetic efficiency •Optimal memory bandwidth usage

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Arithmetic Unit example •Simple scalar arithmetic units •2 cycle pipelined operations •many cycle operations are also possible

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Data-flow congestion •Dynamic pipeline model •Empty spaces disappear in case of congestion •Because propagate backwards the pipeline •Loops are possible to implement

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Processing Element Array •The data-flow enters on a Port, coming from a periphery or memory •The processed data leaves the same way •Operation on the data are stationary, the data is moving •Program code can move on the same channel •Programming the array on-the-fly is just as fast as data transfer •Multiply, crossing data paths are possible -> multitasking

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Data-flow propagation •The data flows in deep branching pipelines •Data propagates only half rate •Very robust against spatially out of phase clocks •Congestion is possible for many cycle operations

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Conclusions •Recent advancements make the physical realization possible •Very good fit to a wide area of algorithms (3D rendering, ray-tracing, databases, scientific computing, video compression) •Solves the memory bandwidth utilization problem •Much harder to program than traditional CPU architectures •Needs significant advancement in compiler technology

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Control-flow: PHI •The opposite of the branches •subtypes: –regular data-flow merge by priority –merge by a condition (control-flow) –balanced merge

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Control-flow: branch •The data-flow can branch: –according to a condition –according to congestion -> to distribute workload

RAC ER: Jo int instru ction -data -stre am based kilo -pro ce ss or ar chitecture and pro gr amming pro ce dure

Ye t… • Hi gh ener gy ph ysi cs de tect or s n eeds • Pr ocessi ng lar ge amoun ts of i nf orma tion • Very short tim e r esponse • Compl ex structur es, man y sensor s wi th topogr aphi c in forma tion • Ef fici en t pr ocessi ng of da ta • Hug e bene fits e xpect ed f rom t he use of Cel lul ar / Mul tic or e t echni ques • Wher e ar e w e ?

20121106IRUN 2012 -Budapest, Magyarország3

Gene ric De tect or Structur e

20121106IRUN 2012 -Budapest, Magyarország6

Pr eamb le • Cel lul ar Neur al N etw ork s (cel lu lar t echni ques) asse ts • Par al lel har dw ar e im pl emen ta tion • Uni ver sal ity • Loc al sensor in forma tion pr ocessi ng • Massi vel y par al lel sy st ems • Mul tiCor e Pr ocessor s • Pr ocessor as one c ompl exi ty st ep abo ve the cel l • Massi ve par al lel sy st ems • Connecti vi ty

201211062IRUN 2012 -Budapest, Magyarország 20121106IRUN 2012 -Budapest, Magyarország5

LHCb Trigger : T wo levels

Level 0 •reduce rate from 40MHz (bunch crossing) to 1MHz •fixed latency 4μs •hardware trigger •customelectronics HLT (High Level Trigger) •reduce rate from 1MHz to ~2kHz •full detector info available •software trigger •CPU farm (~16000 nodes) •latency dependanton number of CPUs

Multic or e tr ends in high ene rgy ph ysics applic ations Xa vi er Vi lasí s Ca rdona (LIF AELS -URL ) N ik o Neuf el d (C ERN)

Thanks to FelicePantaleo(CERN) David Rohr (FIAS) 20121106IRUN 2012 -Budapest, Magyarország1

De tect or s truct ur e and fun cti on s • Tr acki ng • Vert ex de tec tor s • Fol lo w parti cl e tr aje ct ori es • Parti cl e Iden tifi ca tion • Cal ori me ter s • Mea sur e the ener gy • Tri gg er • Da ta Acqui sit ion • Stri pp in g • Anal ysi s

• LHCb • ~2 mi lli on ch an ne ls • 40 MH z • 2 Tb/ s

201211064IRUN 2012 -Budapest, Magyarország

Tomo gr ap hy rec on stru cti on i n E SRF • Se ver al thousand s of imag es per sec ond for 3D r ec ons tructi on • 2000 pr ojec tions for 3 Gv oxe ls • CUD A o n an Nvi di a GPU • Tw o cha lleng es • Compu ta tion Ti me • Da ta thr oughput

20121106IRUN 2012 -Budapest, Magyarország9

Synchrotron X- Ray Radiation 1454IEEETRANSACTIONSONNUCLEARSCIENCE,VOL.58,NO.4,AUGUST2011 Fig.9.Reconstructionoftomographicimagesrequiresreadingandwritingoflargeamountsofimagedata.Thejointread/writethroughputfordifferentstoragesystemsisdepictedintheleftchart.AWDC5000AACSSATAharddriveiscomparedwithanIntelX25-ESSDdisk,twosuchSSDdisksorganizedinastripingRAID-0,andavirtualRAMdisk.TheExt4ﬁlesystemisusedinallcases.TherightchartindicatestheratioofthetimespentincomputationsandI/Orespectively.TheresultswereobtainedusingtheGPUserverwithastoragesystemconsistingoftwoIntelX25-ESSDdrivesassembledinRAID-0. animplementationofadirectreadoutfromtheframegrabber.TheGPUserverisequippedwith96GBofmemoryandisabletostoreboththesourcedataandtheresultingimagedirectlyinthesystemmemory.Onthelongterm,strategiestoreducethedatastreamsuchasdatacompressionandframerejectareneededtofurtherboostperformance. VI.CONCLUSION Moderngraphiccardsprovideover1TFlop/sofcomputa-tionalpowerandcanbeefﬁcientlyusedtospeedupscientiﬁccomputationsbymorethanoneorderofmagnitude.TheﬁlteredbackprojectionalgorithmusedfortomographicreconstructioncanbeimplementedefﬁcientlyandwithgoodscalabilityontheGPUarchitecture.BasedonthesourcecodeofPyHSTwehavedevelopedaGPU-basedimplementationofthealgorithmwhichisabletoexploitthecomputationalresourcesofmultipleGPUandCPUdevicessimultaneously.Theperformanceevaluationconﬁrmsthatareal-timeassessmentispossiblewiththecur-rentGPUarchitecture.UsingaGPUserverequippedwith4graphiccardsittakesonly40secondstoreconstructa3gi-gavoxelimagefrom2000projections.Thisisapproximately30timesfastercomparedtothetimeneededtoreconstructthesameimageusingan8-coreXeonserverwhichhasthesamepriceastheGPUserver.Evena$1000desktopequippedwithasingleGPUcardoutperformstheXeonserverbyafactoroffour.Wereducedthereconstructiontimesigniﬁcantly.However,wearefacingthechallengeofrapidlyloadingmultiplegiga-bytesofsourcedataintothesystemmemory.UsingthefastestSSDdisksassembledasRAID-0wewereabletoperformalldiskoperationsinapproximately2minutes.However,thistimeisstillinadequatecomparedtothecomputationtimeandcur-rentlyweareworkingtoreadimagesdirectlyintothesystemmemory,bypassingtheharddrive.Tobuildanoptimalhardwaresetup,wehavecomparedhigh-endNVIDIAcardsfromthecurrentandlastgenerationsavailableonthemarket.OurevaluationhasshownthatthefeaturesprovidedbytheprofessionalTeslaseriesandthenewlypresentedFermiarchitecturedonotimprovethetomographic reconstructionsigniﬁcantly.EquippedwithtwoGT200pro-cessors,theGeForceGTX295isthefastestadapteramongtheNVIDIAproducts.Forbetterperformanceitispossibletostackupto4ofsuchcardsinasinglesystem.Supermicrosuppliesthe7046GTfamilyofGPUservers,whichincludeupto4GPUcardsatfullx16speed,havetwoPCIex4extensionslotsfortheframegrabberandaRAIDadapter,andsupportupto192GBDDR3memorytoholdboththesourceprojectionsandtheresulting3Dimagecompletelyinthesystemmemory.ThecheaperalternativeisadesktopsystembasedonanAsusRam-pageIIIExtrememotherboard.IfequippedwithtwoGTX295cardsthissystemreaches1TFlop/sperformancewithapricebelow$2000.Withamaximumof48GBmemorysupported,itisstillpossibletocarryoutmostofthereconstructionsdirectlyinmemory.ThenewSATA3(6Gb/s)controllerhelpstoreduceI/OtimewhenusedtogetherwithaRAIDofSSDdrives.Forthefuture,wealsoplantoextendthispipelinedarchitec-turefrom2Dslice-by-slicereconstructionto3Dvolumerecon-struction,asneedede.g.forsynchrotronlaminography[37]. REFERENCES [1]Y.Wang,F.deCarlo,D.C.Mancini,I.McNulty,B.Tieman,J.Bres-nahan,J.I.I.Foster,P.Lange,G.vonLaszewski,C.Kesselmann,M.-H.Su,andM.Thibaux,―Ahigh-throughputX-raymicrotomog-raphysystemattheAdvancedPhotonSource,‖Rev.Sci.Instrum.,vol.72,no.4,pp.2062–2068,2001.[2]F.Garcia-Moreno,A.Rack,L.Helfen,T.Baumbach,S.Zabler,N.Babcsan,J.Banhart,T.Martin,C.Ponchut,andM.DiMichiel,―FastprocessesinliquidmetalfoamsinvestigatedbyhighspeedsynchrotronX-raymicro-radioscopy,‖Appl.Phys.Lett.,vol.92,p.134104,2008.[3]A.Rack,F.Garcia-Moreno,T.Baumbach,andJ.Banhart,―Syn-chrotron-basedradioscopyemployingspatio-temporalmicro-resolu-tionforstudyingfastphenomenainliquidmetalfoams,‖J.Synch.Radiat.,vol.16,pp.432–434,2009.[4]A.Rack,T.Weitkamp,S.B.Trabelsi,P.Modregger,A.Cecilia,T.dosSantosRolo,T.Rack,D.Haas,R.Simon,R.Heldele,M.Schulz,B.Mayzel,A.N.Danilewsky,T.Waterstradt,W.Diete,H.Riesemeier,B.R.Müller,andT.Baumbach,―Themicro-imagingstationoftheTopo-TomobeamlineattheANKAsynchrotronlightsource,‖Nucl.Instrum.MethodsPhys.Res.B,vol.267,no.11,pp.1978–1988,2009.[5]C.Hintermueller,F.Marone,A.Isenegger,andM.Stampanoni,―Imageprocessingpipelineforsynchrotronradiation-basedtomo-graphicmicroscopy,‖J.Synch.Radiat.,vol.17,pp.550–559,2010.[6]J.D.Owens,D.Luebke,N.Govindaraju,M.Harris,J.Kruger,A.E.Lefohn,andT.J.Purcell,―Asurveyofgeneral-purposecomputationongraphicshardware,‖Comput.Graph.Forum,vol.26,no.1,pp.80–113,2007.[7]J.D.Owens,M.Houston,D.Luebke,S.Green,J.E.Stone,andJ.C.Phillips,―GPUcomputing,‖Proc.IEEE,vol.96,no.5,pp.879–899,2008.[8]NVIDIA,CUDAZone[Online].Available:http://www.nvidia.com/object/cuda_home.html[9]Dr.Dobb’sSupercomputingfortheMasses[Online].Available:http://www.drdobbs.com/architecture-and-design/207200659[10]ICL,MAGMA—MatrixAlgebraonGPUandMulticoreArchitectures[Online].Available:http://icl.cs.utk.edu/magma[11]GpuCV—GPUAcceleratedComputerVision[Online].Available:https://picoforge.int-evry.fr/cgi-bin/twiki/view/Gpucv/Web/Web-Home[12]KhronosOpenCLWorkingGroup,TheOpenCLSpeciﬁcation,2009[Online].Available:http://www.khronos.org/registry/cl/[13]A.C.KakandM.Slaney,PrinciplesofComputerizedTomographicImaging.Piscataway,NJ:IEEEPress,1988.[14]AdvancedTomographicMethodsinMaterialsResearchandEngi-neering,J.Banhart,Ed.London,U.K.:OxfordUniv.Press,2008.[15]A.HammersleyandA.Mirone,HighSpeedTomographyReferenceManual[Online].Available:http://www.esrf.eu/computing/scientiﬁc/HST/HST_REF

Chilingarianet al. TNS 58(4) 1447, 2011

Cu rr en t pr op osa ls : s tep s be yon d • Mos tly GP U appl ic ati ons • Rec ons tru cti on of tomogr aphi es in the Eur opean S ynchr otr on Radi ati on Faci lity • In tegr ati on of G PUs in to the ITMS fr ame w ork of da ta acqui sit ion in fus ion experi m en ts • Tr ack rec ons tru cti on simul ati on for P AN DA i n FA IR • Tri gg er of N A6 2 • RICH rec ons tru cti on f or A LICE

20121106IRUN 2012 -Budapest, Magyarország8

Par al lel Oppo rtuni ties • On de tect or (mos tly CA ) • Tr ac ker s • Cal ori me ter s • Tri gg er S ys tems • Ev en t deci sion: one e ven t per nod e + di spa tcher • The G RID • Opport uni ty Aw ar enes s • CERN OPENLAB

20121106IRUN 2012 -Budapest, Magyarország7

DA Q f or fusi on e xperi men ts • In tel lig en t T es t and Measur emen t S ys tem (IT MS) pl atf orm • Fr ame w ork of D AQ f or fusi on exp er imen ts • Sel f adap ting sampl ing ra te acc or di ng to input da ta r at e • Op timi se da ta v ol um e wi thout los s of pr eci sio n • GPU per formi ng adap tiv e al gori th m • Bandwi dth de tec tion • Base d on FFT • Ser ver bo x wi th 2 CPU s and 1 GPU (N VIDIA TE SLA) • Com par ed t o di rect La bVi ew acqui siti on

20121106IRUN 2012 -Budapest, Magyarország12

Pan da de tect or Si mul ati on

20121106IRUN 2012 -Budapest, Magyarország15 Al-Turanyet al. JoPConference series219(2010) 042001

Tomo gr ap hy rec on stru cti on i n E SRF

20121106IRUN 2012 -Budapest, Magyarország11

1452IEEETRANSACTIONSONNUCLEARSCIENCE,VOL.58,NO.4,AUGUST2011 Fig.3.The2000projections(left)wereusedtoreconstructa3gigavoxelimage ofporouspolyethylenegrains(right).Thecomputationcomplexityoftherecon- structionisabout54TFlop.Atotalof35GBofdatamustbereadandstored. Fig.4.PerformanceevaluationofCcompilers(left)andFFTlibraries(right). ThetestwasrunonaDesktopsystemandasingleCPUversionofPyHSTwas deployed.Onlytheperformanceofthebackprojectionstepwasmeasuredin thecompilerbenchmarkandonlytheperformanceoftheﬁlteringstep—inthe FFTbenchmark.ForallcompilerstheSSEvectorizationwasswitchedon.With gccandclangthefollowingoptimizationﬂagswereused:-O3-march=nocona -mfpmath=sse.WiththeIntelCCompilertheemployedoptimizationﬂagswere: -O3-xS.TheFFTW3librarywascompiledwithSSEsupport.Theperformance ismeasuredinGFlop/sandlargervaluescorrespondtobetterresults. Fig.5.Evaluationofthetimeneededtoreconstructthesampledatasetusing differenthardwareplatforms.TheCPU-basedreconstructionwasperformedon theXeonserverandonlyGPUswereusedforallotherplatforms.Shortertimes correspondtobetterresults.

ﬁlte rin g an d all sou rc es ar e co m pi led w ith gcc- 4. 4 us ing -O3 -ma rc h= noc ona -m fpm at h= ss e op tim iz atio n ﬂag s. D. Re cons tr uc tion Pe rf or m anc e Th e pe rf or ma nc e of th e ﬁv e te st sy st em s is gi ven in Fig .5 .If th e GPU is used for ima ge re con st ru ct ion ev en a ch eap de skt op is app ro xi ma te ly 4 tim es fas ter th an an ex pen si ve Xeo n ser ver . It tak es on ly 40 seco nd s to re con st ru ct a 3 gig av ox el ima ge from 20 00 pr oj ec tion sus ing th e GPU ser ver eq ui pp ed w ith four gra phi c car ds .At a pr ic e com pa ra bl e w ith th e Xeo n ser ver ,th e GPU ser ver pe rf or ms re con st ru ct ion 30 tim es fas ter . Ou r im ple m en ta tio n scal es w ell. Acco rd in g to Fig .6 ,o nly 2. 6% of th e ma xi mu m po ssi bl e pe rf or ma nc e is lo st w hil e th e

Fig.6.ScalabilityevaluationoftheGPU-basedreconstruction.Thetestwas executedonanNVIDIATeslaS1070systemwith1to4TeslaC1060GPUs.The performanceofthecompletereconstructionprocessincludingthedatatransfer betweenhostandGPUmemoryismeasured.Thedashedlineindicatesmax- imumpossibleperformance(asifFlopratesofallGPUswouldbejustsummed up).Theevaluationshowsthatourimplementationscaleslinearlywithonly 2.6%ofperformancelossinthecaseifall4GPUsareenabled. Fig.7.MFlop/sperdollarefﬁciencyofthetestedhardwareplatformsforre- constructionusingthebackprojectionalgorithm(ﬁlteringisomitted).Higher valuescorrespondtobetterresults.TheactualperformanceinGFlop/sandthe platformpricesareshowninthechart.

Tes la sy st em is scal ed up from on e to four GPUs .T he ef ﬁci en c y of th e pl at fo rm s in te rm s of M Flo p/s per do lla r ra tio is sho wn in Fig .7 .It is eas y to see th at th e de skt op pr od uc ts ar e su pe- rior to th e ser ver bas ed sol ut ion sus ing th is me tri c. Fur the rm or e, ow in g to th e go od sc ala bi lity of al go rit hm im ple m en ta tio n, th e Adv anc ed D es kt op m ay be fu rthe ren han ced by add ing tw o more gra phi c car ds . E. Ev al uat ion of NVI D IA H ar dwar e NVI D IA del iv er s m ultip le ge ne ra tion s of gra phi c car ds in co ns um er an d pro fe ss io na l ver si on s. In th is se ctio n th e per - fo rm anc e of th e hi gh -e nd gra phi c car ds from th e tw o la te st ge ne ra tion s is asse sse d. GeF or ce G TX 29 5 in clu de s tw o G T2 00 pr oc esso rs. GeF or ce G TX 28 0 is its sin gle pr oc es sor co un ter - par t. NVI D IA Te sla C1060 is a pro fe ss io na ls ol ut ion us ing th e G T2 00 ar ch itect ur e. It has more me mo ry th an co ns um er pr od- uct s bu tat re du ced cl ock rat es .GeF or ce G TX 48 0 is th e cur re nt NVI D IA ﬂag sh ip pr od uc t bas ed on th e la te st GF1 00 (F erm i)

Scalability of the GPU implementation (5500$)

(8000$) (1000$)

(12000$)

(1800$)

GP U ser ver i s 30 ti mes fas ter than X eon ser ver PAND A de tect or si mul ati on • FAIR f aci lity • PAND A de tect or simu la tion : tr ac ker • Tes tin g se ver al con figu ra tion s

20121106IRUN 2012 -Budapest, Magyarország14

Tomo gr ap hy rec on stru cti on i n E SRF • 5 c omput er c on figur ati ons tes ted • Xeon Ser ver • Si m pl e Desk top wi th 1 GPU • Adv anced Desk top wi th 2 GPUs • NV IDIA T esl a Ser ver wi th 4 GP Us • Super mi cr o GPU Ser ver wi th 4 GPUs

20121106IRUN 2012 -Budapest, Magyarország10

DA Q f or fusi on e xperi men ts

20121106IRUN 2012 -Budapest, Magyarország13

Nieto et al.TNS 58(4) 1714, 2011

The NA62 E xperi men t @ C ER N • Look s a t the ul tr ar ar e dec ay K p nn • 1 e ven t i n 10

• 3 l ev el s of tri gg er • 1 har dw ar e • 2 sof tw ar e

20121106IRUN 2012 -Budapest, Magyarország18

In document Proceedings of the IRUN Workshop on the Theory and Practice of Kiloprocessor Architectures (Pldal 57-60)