Proceedings of the IRUN Workshop on the Theory and Practice of Kiloprocessor Architectures

(1)

PROCEEDINGS OF THE IRUN WORKSHOP ON THE THEORY AND PRACTICE OF KILOPROCESSOR ARCHITECTURES

FACULTY OF INFORMATION TECHNOLOGY PÁZMÁNY PÉTER CATHOLIC UNIVERSITY

BUDAPEST, HUNGARY

2012

(2)

Faculty of Information Technology Pázmány Péter Catholic University

Conference Proceedings

(3)

PROCEEDINGS OF THE IRUN WORKSHOP ON THE THEORY AND PRACTICE OF KILOPROCESSOR ARCHITECTURES

FACULTY OF INFORMATION TECHNOLOGY PÁZMÁNY PÉTER CATHOLIC UNIVERSITY

BUDAPEST, HUNGARY November 5-7, 2012

Pázmány University ePress

Budapest, 2012

(4)

© PPKE Információs Technológiai Kar, 2012

Kiadja a Pázmány Egyetem eKiadó 2012

Budapest

Felelős kiadó

Ft. Dr. Szuromi Szabolcs Anzelm O. Praem.

a Pázmány Péter Katolikus Egyetem rektora

Készült a

TÁMOP-4.2.2/B-10/1-2010-0014 projekt keretében, és az Új Széchenyi Terv támogatásával

ISBN 978-963-277-041-3

(5)

Pázmány Péter Catholic University Faculty of Information Technology

Proceedings of the IRUN Workshop on the Theory and Practice of Kiloprocessor Architectures

November 5-7, 2012

Budapest, Hungary

(6)

Organizers

Pázmány Péter Catholic University Faculty of Information Technology IRUN: International Research Universities Network

IEEE Hungary Section

Sponsor European Union

Promotion of Excellence at the Pázmány Péter Catholic University

TÁMOP-4.2.2/B-10/1-2010-0014

Scientific Committee Prof. Tamás Roska

Member of the Hungarian Academy of Sciences Head of the Jedlik Laboratory, PPCU

Prof. Ákos Zarándy, DSc

Computer and Automation Research Institute of the Hungarian Academy of Sciences, Research Advisor

Prof. Péter Arató

Member of the Hungarian Academy of Sciences Budapest University of Technology and Economics (BME)

Prof. Ferenc Friedler, DSc

Department of Computer Science and Systems Technology, University of Pannonia

Organizing Committee Prof. Péter Szolgay

Marianna Szalay

Contact: irunbud2012@itk.ppke.hu

(7)

Wha t i s the fundamen tal ly ne w ar chi tectur al concept i n nanosc al e? The topo gr aphic ad dr ess of a pr ocess or and memo ry Wha t i s the funda men tal ly ne w softw ar e pl atf orm in n ano sc al e? Th e appl ic ati on speci fic Virt ual (multi ) Cellul ar Machine mapped on to a P hy sic al Machi ne SpiNNak er : 1m ill ion ARM9 Cellul ar Mac hine

SpiNNaker

Spi king Neu ral Netwo rkArchite cture • 130 nm CMOS Technnol ogy • 18 cor es per chi p, 48 chi ps per boar d, • 57,6 00 cuspom chi ps • 50 -100 kW (1/10 0 of t he g ener al purpos e sup er compu ter , 10 0 times bi gg er than an anal og CMOS • IEE E S pectru m, Augus t 201 2, pp.4 0- 43

Physical an d Vi rtual Cellu lar Machi nes for Nan oscale Archit ectu res Ta m ás ROSK A P á zm á ny University , the University of N otre Da me,and the Hun garian Ac ade m y of Sciences , Bud ape st, Hun gary Budapes t, Nove mbe r 5 -7, 2012

Int ernational Re sea rch University Ne twork (IRU N) W orks ho p, Bu da pe st Wha t i s the funda men tal ly ne w el em en tar y ins tructi on in nano sc al e? Th e spa tial -t empor al non -Boolean w av e on a cel lul ar pr ocess or and memor y arr ay 2. FPG As – thr ee cel l arr ay s: l ogi c- and ari thm eti c- pr ocess or s, and memori es XILINX VIR TEX 6 FPGA • Ov er 70k Adv anced Si lic on Modul ar Logi c Bl ock s i n a cel lul ar a rr ay • Ov er 2k DSP „sl ices” in a cel lu lar arr ay • 36 kB memor y bl ock s i n a cel lul ar arr ay • App lic ati on Speci fic arr ay c ompon en ts • Logi c and ari th me tic pr ocessor arr ay s a nd memor y arr ay s ar e t iled on a 2D pl at e • Loc al di recti on al pr ef er ence (s tr eam)

Th eor y and Practice o f Ki lo pro ces sor Ar chit ec tur es , IRU N Work shop, 201 2 … wh y?

Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion(execution units), and light blue portions (register file and L1 cache).

(9)

Ph ysi cal P ar ame ter s of Cel lu lar Machi nes The fiv e k ey ph ysi cal par ame ter s of t he bui ldi ng bl ock s and the sy st ems, • Speed • Pow er di ssi pa tion/ ener gy • Ar ea/Si ze • Bandwi dth/ la tency and the • Accu racy PCM: FPGA (e. g. XILINX or Alt er a) VCMb : UM F alg orithm VC M : t w o t ype s o f F AL CON imple me nt ation

VCMa : Na vie r- St ok es PDE VCM1 VCM2

Virt ual and Ph ysic al Cellular Machines • The ar che typ e: the v irt ual mem or y (hi di ng memory )- the pr ecedence of dat a loc alit y • N ow the basi s: i n the the ph ysi cal man y- cor e chi ps -the pr ecedence of pr ocessor and mem or y l oc al ity • Cel lul ar: the pr ecedence of g eome tri cal loc alit y – smal ler c ommu ni ca tion di st ance in the ph ysi cal machi nes Th e desi gn ch al lenge: Morph ism s be tween three do mai ns The thr ee domains: • * dat a /ph ysic al objects in the pr oblem e. g. pix els, ta xels, at oms, gr amm ar units, subgr aph • * ph ysic al proc essing cores/ cells and memo rries within a chip, and * subr outines and algori thm s on t opogr aphi c pr ocessor/mem or y arr ay s fo r topogr aph ic and n on -t opogr aph ic pr oblems

Vi rtual Cel lul ar Mac hi nes • Bui ldi ng bl ock s ha ve no ph ysi cal c ons tr ai nts - siz e, bandwi dth, speed, pow er di ssi pa tion, ex cep t the tw o cl as ses of mem or y access loc al and gl obal ) • Des cri pti on i n C , OpenC L, et c., an d si m ul at or s • Man y V irt ual Machi nes to o ne p hy sic al Machi ne The r ol e of appl ic ati on speci fic c ompi ler s

(10)

Thr ee diff er en t eng ine ering le vels • The Ph ysi cal le ve l • The pr ot otype Vi rt ual le ve lar chi tect ur e (tha t mi gh t be appl ic ati on spe ci fic) ,t hat sho ul d be map ped to a Ph ysi cal le ve l • The Al gori thmi c/ softw ar e lev el ,ope ra ting on the pr ot otype /appl ic ati on speci fic Vi rt ual mach ines Al gori th mi c pri nci pl es • Comput ing vi a s ynchr oni za tion of o sci lla tor y Cel ls – O -CNN • Spar se1 D r epr esen ta tion of 2 D i mag es and map s(T .Shi ba ta e t.al .) • Fi rs t r esu lts on f ramel ess de tecti on of sp ati al - tempor al act ions (dynami c pa tt erns) Ph ysic al impleme nt ation

Optical Input & PreprocessingFeature vector generation CMOS to non-CMOS conversion T1 OCNN arrayT2 OCNN array T4 OCNN arrayT3 OCNN array non-CMOS to CMOS conversion Winner localizer algorithm

Contact pads CMOS CMOS

A fund amen tal chal leng e Fr ameless de tection De tect a s pa tial -t empor al ev en t, an action, without fr ame b y fr ame analy sis • The acti on i ts el f, or t he a tomi c spa tial -t emp or al fea tur es, ar e de tect ed as the onse t of a char act eri sti c l oc al ly comput abl e pr operty

ECCTD 2005 T.Roska and Á.Zarándy

Output /activation: y auxiliary sensors

aCCN-UM

light intensity &/or pattern control light beam: gij(t) target environment

Pr oacti ve Ad ap tiv e Cel lul ar Engi ne (P ACE )

The Design Sc enar io Mappi ng Alg orithms Equ iv alen t T rans form ation s Op er ation on Pr oc esso r / memo ry topogr aph y Alg orithms on Da ta / objec t topogr aph y Task/Pr oblem /W orklo ad Ph ysic al Impleme nt ation

Virtual Cellular MachinePhysical Cellular MachineCOMPILER?

Ar chi tectur al c ons tructs • The i ntr oduct ion of dynami c temp la te el emen ts [ H.M.D .Ip, E.Dr ak aki s, et.al .] • The i ntr oducti on of Osci lla tor y Cel ls [A .Hor vá th, F .Cori nt o, T.R osk a et. al .] • The i ntr oducti on of l igh ting pa tt erns [M.K ol ler and G y. Cs er ey] i n a Pr oacti ve Ad ap tiv e CN N Ar chi tectur e [T .R osk a a nd Á. Zar ándy] Associa tiv e Memor y via O -CNN and multiple O -CNN Non -Boolean Architectires and Devices – an Intel (Uni ve sr ity Res ea rc h Of fic e) supported basic research Project (G. Bouria nof f et. al.)

(11)

A k ey ins truction via s ynchr oniz ation : fr equency shi ft or phase input v ect or gener at es an output ph ase signa tur e vect or aft er s ynch roni za tion Al gori thmi c ar chi tectur es Rec ogni tion of a tomi c spa tial temp or al moti fs wi th out fr ames • St andar d CN N + del ay • Space -v ari an t CN N - loc al adap ta tion • Boundary con tr ol led CN N • Osci lla tor y CN N O -C N N • O -CN N pl us bou nd ar y con tr ol • Spa tial -t empor al Actua tion c on tr ol (l igh tin g, pr essu re, audi o, radi ati on, et c.) wi th a nd wi th out feedb ac k Z. Néda et . a l.

ST AR CNN : It oh -Chu a, Hop pen staedt -Izhi kev ich, Corinto - Bon in -Gi lli Fr amel ess de tecti on of spa tial - tempor al acti ons (dynami c pa tt erns) Seman tic embeddi ng De vel opi ng s pa tial -t emp or al sem an tics • App lic ati on speci fic si tua tion cl ass es • Char act eri sti c spa tial -t emp or al e ven ts • Spa tial -t empor al rel ati ons de fini ng pr ot otype ev en ts • Composi ng rel at ed da tabases b y r eal li fe da ta aqui sit ion Ex ampl e: Bi oni c E yegl as s

O -CNN – osc illa tory CNN x

XXXX XXXXX

A B C

Input vec

tor Output vector

O -CNN AM clusters Si gna ture

w1,1w1,2w1,3w1,n w2,1w2,2w2,3w2,n wm,1wm,2wm,3wm,n

General cellular mod

el Simple 1D cellula r mod el Signa tur e

w1,1w1,2w1,3w1,n

The fun ct io na lit y dep en ds on th e c ou pl in g we ig hts of the OCNN ne twor k co ntr ol le d vi a the in ter ce ll di st an ce s. The se pa ra bi lit y/ c la ss ifi ca tio n ac cu ra cy is impro ve d by tun ing the weig hts . Si gna tur e r ec ogni tion Da ta r edu cti on techni qu es • De tecti ng the moti fs in 2D vi a mul ti- channel pr ocessi ng • Comp ressi on and mapp in g in to 1D f ea tur e vect or s • Mappi ng 1D fea tur e v ect or s i nt o mul tipl e sign atur es vi a associ at e mem or y cl us ter s • Ma tchi ng sign atur es wi th pr ot otyp e ones wi th di ffer en t me tri cs (Euc lidean, Hammi ng , Hausdorf f, and aut ow av e typ e)

(12)

ChairofFundamentalsofElectricalEngineering

Ronald T et zlaff

Institute ofFundamentalsofElectricalEngineering FacultyofFundamentalsofElectricalEngineering Technische Universität Dresden

IRUN W orkshop on the Theory and Practice of K ilo pr oc es so rA rc hi te ct ur es 5

th

–7

th

November 2012

High-speed optical control systems in industrial manufacturing processes based on Cellular Neural Networks

ChairofFundamentalsofElectricalEngineeringChattermarksLargescratchesSmallscratchesCrossdents

High-speed optical systems are important in severalmanufacturing processes, e.g. for the inspection of metal surfaces in wire drawing

Microelectronics in Saxony

High-speed optical systems are important in severalmanufacturing processes, e.g. for the inspection of metal surfaces in wire drawing

Dr awing d ie

Production line with a chain of drawing dies Feeding rate: 10 m/s Final wire diameter: 3.3 mm Surface defects cause material defects

In ter ac tion be tw ee n tw o m agnet s – th e mag ne tic tem pl at e elem en t wh ere : is the dipo le str eng th – B is the strength of the ex ternal magnetic fiel d (perpendi cul ar) – r

jk

is the di stance be twee n the two magn ets – uni t vector paral le l wi th the lin e conn ecting the t wo magnets

•1828:TheTechnicalSchoolisfoundedinDresdenandtheRoyalSaxonCommerce Deputationtakesovertheadministration. •1871:Inrecognitionofthehighleveloftheeducationofferedintechnologyandthe naturalsciences,theschoolisrenamedtheRoyalSaxonPolytechnical College.281studentsareenrolledatthecollege(morethan60%belongto theEngineeringDepartment). •1961:TheTechnicalCollegeofDresdenreceivesthestatusofTechnicalUniversity. •2011:ThecampusfamilyofTUDresdenaremorethan36000students,5,319 publiclyfundedstaffmembers–amongthem507professors–and approximately3,265externallyfundedstaffmembers. TheTUDresdenisamongthetopuniversitiesinGermany.Asamodernfull-status universitywith14departmentsitoffersawideacademicrangeandisthelargest universityinSaxony.

Technische Universität Dresden

Ronald T et zlaff

Institute ofFundamentalsofElectricalEngineering FacultyofFundamentalsofElectricalEngineering Technische Universität Dresden

IRUN W orkshop on the Theory and Practice of K ilo pr oc es so rA rc hi te ct ur es 5

th

–7

th

November 2012

•Problem •Controlof Laser Beam Welding (LBW) processes •Wiredrawingprocesses •Conclusion

High-speed optical control systems in industrial manufacturing processes based on Cellular Neural Networks

(13)

Intelligent cameras production process surveillance feature extraction …

sensing + processing reaction

A CNN is defined by:

1)A spatiallydiscretecollectionofcontinousnonlineardynamical systemscalledcells, whereinformationcanbeencryptedinto eachcellvia threeindependentvariables calledinput, threshold, andinitialstate 2)A couplinglawrelatingoneormorerelevant variables ofeach celltoall neighborcellslocatedwithina prescribedsphereof influenceofradiusr. [Chua, 1998]

Cellular N eural N etwork s (CNN)

24 years later …..

L. C hu a un d L. Y an g, 1 98 8: S ta nd ar d C N N CNN based vision systems have gained maturity ….

High-speed optical systems are important in severalmanufacturing processes, e.g. for the control of laser welding processes Zinc-coated steel sheets in overlap-joint smoke residue spatters ChairofFundamentalsofElectricalEngineering

parallel:

Vi sion systems

CPUprocess control imagedata Signal

conventional: CellularNeuralNetwork CPUCPUCPU CPUCPU CPUCPUCPU

CPU

CPUCPUCPU CPUCPU CPUCPUCPU

CPU

process controlsignal CPU pre‐processing ChairofFundamentalsofElectricalEngineering

Phenomena and dynamics of CNN •Complexity -pattern formation •Nonlinear wave propagation

-solitary waves -Solitons •Chaos •Synchronizationbymodeling bio-inspired systems

Cellular Wave Computer CNN endowed with local analog memory, local logic memory, some control and communication circuitry and with optical sensors. •supercomputingchipforcombined sensingandprocessing •teraoperationsata die sizeof1cm2 morethan10 000 fr/s ( upto50 000 fr/s)

L. C hu a un d L. Y an g, 1 98 8: S ta nd ar d C N N

High-speed optical systems are important in severalmanufacturing processes, e.g. for the inspection of metal surfaces in Ball bearings: •Edge-size variation •Cracks

Valveplates: •Cracks •Disruptions ChairofFundamentalsofElectricalEngineering

Integrated sensing and processing by intelligent cameras structure ? standard technology parallel, direct processing of sensor values, bio-inspired, adaptive, robust

Cell

i,j

S

)i,j

z ,

i,j

u ,

i,j

x = H ( . y

i,j

,

network nonlinear operationsynaptic coupling

state equation

bias (example)

State equation of a CNN

(14)

24 years later …..

L. C hu a un d L. Y an g, 1 98 8: S ta nd ar d C N N CNN based vision systems have gained maturity bu t o nl y relatively simple forms of the state equation have been realized on vision systems (lo w re so lu tio n) up to no w .

Fastest Conventional Systems/Line Cameras

WiredrawingqualitycontrolCoaXPresslinecamera e2v ELIIXA+ (2011) Line camera < 100.000 linesper second

Zumbach SIMAC 63 (2010) 3 linecameras < 35.000 linesper second ChairofFundamentalsofElectricalEngineering

Results

ChattermarksLargescratchesSmallscratchesCrossdents Defect typeReference imagesDetection rate (%)

Frameratef (kHz) Nodefect226010 7.64.1Cross dents10100 Chatter marks45100 Large scratches91100 Small scratches12997

D ig ita l C N N A rc hi te ct ur es

−Falcon (Szolgay, Nagy),… −Recent design (Müller, Tetzlaff) −retaining CNN paradigma: ¾cellular structure, local connections ChairofFundamentalsofElectricalEngineering

Requirements for Surface Inspection System

Critical defect size: 100 µm in drawing direction Sampling theorem: 10 m/s Î200,000 line images per second

Cross dent3-D measurementDrawing process: FEM model ChairofFundamentalsofElectricalEngineering

Testbed for W ire Drawing

PulsedLED illuminations 10 µs, 20 W/cm²

EyeRIS1.2 (Q-Eye)

CNN-chip, producer,Yeartechnology, resolution,pixelsize

cameras: producer

, type, (Zusätzl. DSP) ACE16k2004350 nm, 73x75 µm²128x128 Pixel²AnaFocus: EyeRIS 1.1 (DSP) Analogic: Bi-i (DSP) Q-Eye2008-2011180 nm, 33.6 x 33.6 µm²

176x144 Pixel²AnaFocus: EyeRIS1.2 (DSP) EyeRIS1.3 (DSP) EyeRISVSoC SCAMP3, ASPA-3, SCAMP5

2005-2011 2012

350 nm 180nm 180 nm

128x128 Pixel 180x80 Pixel² 256x256 cellarray P.Dudek, University ofManchester: MIPA4k2009130 nm64x64cellarrayPoikonen, Laiho, University ofTurku, Finland VISCUBE20123-D-integratedchip: sensor(320x240 Pixel), 2 processorlayerwith160x120 and8x8 cellarrayEyeRISVSoC

CNN based realizations

There is a gr ow in g in te re st in high-speed CNN based vision systems for manufacturing processes higher resolution, sensitive sensors, adaptivity (learning) more complex state equations (wave pr op ag at io n)

W ish list:

Setup for Gapless Surface Inspection

Focal depth Δzzy

x

Illumi- nation

1 α

β α= 45

5 mm

Illumination 2 CNN camera Wire

4 mm

Feeding rate 10 m/s Field of view: 5 mm, feeding rate 10 m/s ÎMin. frame rate of 2 kHz Save detection Î50 % overlap ÎMin frame rate of 4 kHz

(15)

Laser Beam W elding (LBW)

•Zinc-coatedsteelsheetsarecomposedby twometalswithdifferentfusionandboiling points. •Assoonasthebeamhitsthematerial surface,themeltofthesolidmaterialsstarts andacapillaryisgenerated. •Thecapillaryissurrounded: •Onitsfrontsidebyaliquidsteellayer. •Onitsrearsidebythemeltpool (TmeltingFe=1809K). •Inthesheetinterfacebyzincvapours (TvapourZn=1180K).

Substrate (overlapjoint)

Incident laser lightReflected laser light Liquid-Vapor ChairofFundamentalsofElectricalEngineering

•Conventional uncontrolled processes −Predefined profile of laser power or feeding rate. −Possible imperfections, e.g. craters, holes,

smoke residue and spatters. −Do not allow easily changing welding parameters, e.g. process speed and material thickness. •Due to the high dynamics of LBW, a

robust feedback system, requires control rates in the

multi-kHzrange. •No visual real-time control has been developed yet, since conventional cameras are not sufficient to reach the requiredcontrol rates.Workpiece (overlapjoint)

Laser beam

Uncontrolled full penetration weld of 2x0.7 mm zinc-coated steel sheets in overlap-joint configuration with 0.1 mm gap, at 9 m/min and constant laser power of 5.5 kW with 10% power as factor of safety.

Laser Beam W elding (LBW)

THE FULL PENETRATION HOLE (FPH) The FPH detection is used as feedback to control the laser power!

Image features

THE PROJECT WAS SPONSORED BY THE "LANDESSTIFTUNG BADEN‐ WÜRTTEMBERG"

Control of Laser Beam W elding (LBW) processes

Control of laser welding processes

Latencyperiod< 1 msFeature extraction: 5-10 kHz

0.00 ms0.33 ms 0.67 ms1.00 ms

0.0 ms 4.1 ms ChairofFundamentalsofElectricalEngineering

Vi sual closed-loop system

•PLC: ProgrammableLogicCircuit.•SCU: System controlunit.•Eye-RIS visionsystem(VS). •Laser machine. Fullpenetrationhole(FPH) Thermal radiatonof the meltin the IR range

Sensor for Production line

Sensor head5 m/s2 m/s ChairofFundamentalsofElectricalEngineering

Laser Beam W elding (LBW)

•Duetothehydrostaticpressureofthe metalvapor,alltheplatesofthewelding setuparepenetrated,creatingtheso called"fullpenetration". •Inthecoaxialcameraimage,itisvisibleby adarkareadirectlybehindthelaser interactionzone,thesocalledfull penetrationhole. •Thefullpenetrationisanimportantquality featurewhichguaranteesthestrengthof thematerialconnection. ChairofFundamentalsofElectricalEngineering

A Cellular N eural N etworks (CNN) based camera

•TheAnafocus‘ Eye-RIS VS allowsreaching control rates above10000 images/s. −Grey-scale images: pixel values range within 0 and 254. −Binary images:pixel values are 1 or 0 if they are white or black, respectively. •The Q-Eyecanbeprogrammedbythe specificationoftypicalCNN operators. •CNNare grids of MxNcells where each cell can interact: −Directly with its neighboring cells. −Indirectly with the other cells because of propagation effects of the network. −Its dynamics is defined by a state equation. •CNN areparticularsuitableforimageprocessing tasks.

Q‐EyeFPGAI/O

(16)

SPATTER DETECTION •Spatters often appear overlapped with the vapour plume which has a fluctuating intensity. Solution: local threshold. •The execution time of a single image processing step is about 50 µs. •The mask shape limits this algorithm within the orientation-dependent strategies. ChairofFundamentalsofElectricalEngineering

•The following experiments will concern welds of two zinc-coated steel sheets in an overlap joint configuration, executed also under variable welding conditions. •At each control step i, a new sensing phase is performed while the image acquired at the previous time stepi-1is being evaluated. If the FPH is found in the image, the laser power is decreased, and increased otherwise. •Typicalcontrol rates, including image sensing, evaluation, and control tasks are: −Dilation algorithm: 14 kHz. −Omnidirectional algorithm: 6 kHz. −Algorithm for spatter detection: 15 kHz −Multi-feature algorithm: 8 kHz.

Experimental results

VARIABLE PROCESS ORIENTATION •Controlledfull-penetration weldbytheomnidirectionalalgorithm: -Thickness: 2x0.7 mm. -Gap: 0.1 mm -Speed in thestraightpaths: constant. -Curved shapes were performed at lower speeds.

Experimental results

CNN based visual algorithms

COMBINED DETECTION OF FPH AND SPATTERS Multi-feature algorithm: •Laser power control by FPH extraction and quality monitoring by spatter detection. •Single imageprocessingtime of approximately95 µs. ChairofFundamentalsofElectricalEngineering

CNN based visual algorithms

FULL PENETRATION HOLE DETECTION: simulation

results •Onlya fewfalse detectionsof the same typeoccurconsecutevely. •Errorrates of about 18% do not represent a threatfor the process control. ChairofFundamentalsofElectricalEngineering

VARIABLE MATERIAL THICKNESS •Thickness reductions are followed by a decrease of the laser power, which avoided holes in the workpiece. •Transitions from thinner to thicker material led to an increase of the laser power, which prevented penetration losses. •Controlledfull-penetration weld: -Gap: 0.1 mm. -Processspeed: 4 m/min.

Experimental results

SPATTERS

Image features

MASK BUILDER MASK BUILDER FOR THE FPH DETECTION MASK BUILDER FOR SPATTER DETECTION ChairofFundamentalsofElectricalEngineering

VARIABLE PROCESS SPEED •Controlledfull-penetration weld: -Thickness: 2x0.7 mm. -Gap: 0.1 mm. -Processspeed: on theleftfrom3 to9 m/min through5 and7 m/min. On therightthespeedprofilewas reversed.

•Controlledpartial-penetration weld: -Thickness: 2x1.0 mm. -Gap: 0.2 mm. -Processspeed: 7-6-7-6 m/min.

Experimental results

(17)

45

Berthold-Leibinger-Innovation Award 2012: 3rd place

Dr. L. Nicolosi Institut für Grundlagen der Elektrotechnik und Informationstechnik, Technische Universität Dresden Dr. A. Blug Fraunhofer Institut für physikalische Messtechnik IPM, Freiburg F. Abt IFSW Institut für Strahlwerkzeuge , Universität Stuttgart

Thank st o

Berthold-Leibinger-Innovation Award 2012: 3rd place

Conclusion

•CNN based vision systems have gained maturity for control of industrial manufacturing processes •There is a growing interest from industry. Beneficial would be the realization of new programmable systems (higher resolution) implementing more a CNN state equation. •Implementation of CNN with nonlinear weight functions •Further applications are possible in automotive industry. Therefore reliable (robust) systems should be available in a mass production leading to low system prices.

Partial Penetration

uncontrolledwelding: defectbottomlayer

CNN basedcontrolledwelding10 mm 1 mm ChairofFundamentalsofElectricalEngineering

Conclusion

1) Wire drawing for the inspection of rapidly moving surfaces allowing the detection of small defects 50 µmat 5 m/s 100 µm at 10 m/s 1 mm at 180 km/h 2) Laser welding: Controllingrateswithin6 and 15 kHz werereached. An algorithmfor the combineddetectionof spatterand FPH wasalsodevelopedleadingto control ratesup to 8 kHz.

“Although nowadays process monitoring systems are suitable for various laser applications, a process control system to prevent weld seam defects online is still to come and a desire of likely all users” [Schmidt et al] ChairofFundamentalsofElectricalEngineering

The mos t dangerous of all falsehoods is a sligh tly dis tort ed truth. Georg Chris toph Lich tenberg

(18)

• Cellu lar Neu ral Networ ks (C NN )

–FPGA implementation: Falcon architecture –Partial differential equation (state equation) is discretizedin space and time –Mathematical expression is evaluated for each vertex of a regular gird

• Fu rt her poss ibl e use cas es: ot her parti al dif feren tial equation s

–Computational Fluid Dynamics (CFD) –Computational Electro Magnetics (CEM) –The Falcon memory structure can be reused, only the arithmetic unit has to be re-designed.

Introducti on Global or loca l co nt ro l? Dr awb ack s of co mm on partiti onin g algorithm s



Plac em ent



Dea dlock



Pipeli ne len gth

• Int ro ductio n

–Why partitioning? –Global vslocal control

• Obje ctives • Al gori thm: new representatio n • Al gor it hm: fl oo rpl an bas ed parti ti on in g • Impleme ntati on res ul ts • Summa ry and fu tur e work s

Agen da The block diag ram of th e co mp lete ar chitectu re Par titi onin g and placement objectiv es



Mini miz e the numbe r of cut edg es and constr ain the number of IO connectio ns of each cluster



Minimiz e the leng th of the long est interconne ction between the clust ers



Av oid dea dlo ck



Pipeli ne lengt h

Aut omatic generati on of lo cally con tr ol led ari th metic un it via fl oo rpl an ba se d parti tio ni ng

(Csaba Nemes)

A typical sce nario Global or loca l co nt ro l?

(19)

Alg or ith m : flo or plan Vert ices ar e pl aced vi a simu lat ed ann ea ling gui ded b y the linear c om bi na tion of the fol lowi ng objecti ve functi ons: (1) Tot al squar ed di st ance (T SD) of the connect ed v ert ices (2) Ma xi mum di st ance (MD ) be tw een vert ices whi ch ha ve a comm on i nput (3) Ma xi mum in ter connecti on l eng th( MIL) Par titi oned data -f low gr aph

Prepro ces sing and la yer ing

Layering: the data-flow graph is converted to a special bipartite graph.In this bipartite graph every vertex is associated with a vertical level and each arc directs immediately to the next level. It can be generated via a breadth-first-search in linear time by splitting up the arcs, which span more than one level, with extra delay vertices.Benefits: decrease the complexity, avoid deadlock, horizontal cutting

Alg or ith m : partiti onin g

Partitioning is done with simulated annealing. The floorplannedgraph is divided into belts. In each iteration the spin of one vertex of a belt is perturbed, and the belt is recolored. The energy function is the linear combination of the following objective functions: Number of cut arcs Maximum number of cut arcs which belong to the same cluster (maximum cluster IO) Number of clusters Number of connections between non-neighboring clusters which are on the same belt Number of mutual dependencies between neighboring clusters which are on the same belt

Sum mary



The fr amewor k gener ates loc al co ntro l to the input mathematical expr ess ion. (VHDL, ready to u se with Xil inx ISE)



No vel floorplan based partiti oning is propos ed to dete rmine w hic h FPUs shal l be cont rol led toge ther to increas e the oper ating fr equency of the A U .



No vel representati on is propos ed to repres ent a partiti on of placed vertic es of the floorplanned gr aph.



The pro pos ed algo rit hm is ev aluate d in ca se of two mathematical expr ess ions related to C FD and 15 -25 % speed -up has been reached.



Further mathematical expr ess ion s shall be in vesti gated

Propos ed algorithm

1.Place the vertices of the data-flow graph in the plane to minimize the distance between the connected edges via SimualtedAnnealing. (Floorplan) 2.Partition the floorplannedvertices to minmizethe previously described objectives via another Simulated Annealing. (Partitioning) 3.[ Optional: the positions of the clusters can be used to generate Xilinx placement constraints for the circuit elements ]

To represent a partition of a floorplanneddata-flow graph a novel representation is proposed.

Alg or ith m : re presentation Object iv e: R epr esen t o nl y va lid parti tions, in whi ch cl us ter s c an be cov er ed b y co nti nuous non -ov erl appi ng r egi ons. M ai n idea: Val idi ty of a parti tion can be guar an teed if e ver y vert ex inheri ts its co lor fr om o ne o f i ts nei gh bor s o r i s pai nt ed w ith a uni que co lor . As the v erti ces ha ve u ni fo rm si zes the di rect ion of the inheri tance can be descri bed wi th a spi n as so ci at ed to ev er y vert ex.

notfull89101112131415200220240260280300320340 900011000130001500017000190002100023000 272275

314320326315320314318309 Unstructured CFD Max I/O connection of a cluster

Fre que ncy

(M

Hz) Occu pie d sl

ice

s notfull8910111213200220240260280300320340 900011000130001500017000190002100023000 252

279

311321321314301295 Structured CFD Max I/O connection of a cluster

Frequen cy ( MH z)

Occu pie d sl

ice

s

Speed Area

Impl ement ation re sults

The reference case “not” indicates the unpartitionedcase, where the data-flow graph is not partitioned. (The IO cut is the num of IO of the whole data-flow graph) The reference case “full” indicates the fully partitioned case, where every floating-point unit forms a separate cluster. (The IO cut is the degree of the given vertex) Partitioned cases produce a 15-25% increase in operating frequency.

(20)

Soluti ons • Be tt er Mem or y In terf aces: Same FL OPS , be tt er PE uti liz ati on . • Decr eased oper ati ng fr eque ncy of pr ocessi ng el emen ts (PE): lo w er FL OPS (f req) , bu t be tt er uti liz ati on and FL OP /W at t(Gr een comp.) • Incr eased On -Chi p memory: Be tt er da ta reuse, low er FL OPS (c hip ar ea) wi th be tt er uti liz ati on . • Pr eop tim ized i nput and al gori thms: Same FL OPS , be tt er uti liz ati on. M esh Compu tations • Si mul ati ons o f ph ysi cal sy st ems: so und, hea t, el as tic ity , el ec tr odynami cs or flui d fl ow dynami cs • Da ta i nt ensi ve pr obl ems, hi gh co mpl exi ty • Irr egu lar me mor y acc ess + da ta r el oad = Lo w pr ocessor ut iliz atio n

ScramjetWindtunnel

Preoptim ization – Sin gle FPGA Pr eo pt_Si ng le(mesh, cache_ siz e): • Reor deri ng wi th GPS(Gi bb s, Pol e, S tock ey er) m ethod • If G_B W cons tr ai nt (G_B W*2+1 < cac he_si ze) not sa tis fied: Bandwi dth Li m ited P arti tioni ng (bi secti on)

M emory Bandw idt h Limi ts

•Nearly two orders of magnitude difference between theoretical computational power and the zero cache case •FLOPS* <Real FLOPS< Theoretical Maximum FLOPS

Preoptim iz ed inp ut and algorithm s • Loca lity -Based Placemen t of Inpu t Da ta in Main Mem or y: Input s of a n oper ati on pl ac ed near t o ea ch other i n memory . • Alg orith ms w ith ma ximal da ta reuse , and m ini mal I/O: Rec al cul ati on of da ta can be cheaper than I/O . • Ar chit ectur e depen den t alg orithms: At tri but es of the ar chi tectur e(c ac he siz e, num . of PEs, in put B W) a re in put par ame ter s of th e pr ogr ams. • Ar chit ectur es for alg orith m classes (FPGA ) FPGA - Soluti on To h andl e mem or y bandwi dth li mi ta tions of mesh computi ng w e s ug ges ted an FPGA desi gn whi ch: • Reads node da ta fr om of f-chi p DRAMs by seq uen tial bu rs ts (s tr eam) • Reads and wri tes bac k al l node da ta onl y once. The whole DRAM bandwidth is utili zed to fe eding pr ocessor units . The inde x of mesh no des in off -c hi p memory det ermi nes a G -BW .T he requir ed siz e of on - chip mem or y is (2 G_B W +1 ) No deSi ze fo rt he gi ven ar chi tectur e.

Impr oving da ta l oc ality f or mesh compu ta tions Cellu lar Nonl inear Netw or ks (C NN )

•PEs get input directly (light sensors) –input BW problem solved •No shared cache, only local memory at each PE •Only local connections between PEs

In pu t is an imag e -space and time ar e also in form ation : the addr ess(spac e) of da ta X i n mem or y( input imag e) is de term ines whi ch da ta Y ( nei ghbor s of X) wi ll be i nv ol ved in an oper ati on de fined on X. Data Lo calit y - Stream Comp ut ing

Reordering

•Graph + Node Indexing Function => G_BW • G_BW is the maximal distance between nonzero elements and the diagonal of the adjacency matrix

(21)

Depth Lev el Structu re (DLS) Fi ndi ng Struct ur e i n Uns tru ctu red Mesh es: • Boundary sur fac es ar e k no wn • Br eadth -fi rs t sear ch – BF S fr om the Boundary Se t de fin es the Dep th Le vel Structur e R esults

Bandw idt h Limi ted Par titoni ng • Ev ery p art has node or deri ng wi th les s G_B W than a g iv en bound. (or deri ng ha ve t o be gi ven) • Ev er y part has be tt er C OM_R than a g iv en bound. • Ev ery p art has nearl y t he same siz e. • Num ber of part s i s ma xi mi zed (< ma x_P) D LS -Bas ed Bisec tion

Preoptim ization – M an y FPGA s Pr eo pt_Man y(mesh, cache_ siz e, ma x_C OM_R, ma x_P ): • Bandwi dth Li mi ted P arti tioni ng • COM _R: Commun ic ati on Ra tio = num . of outg oi ng edg es / i nt ernal edg es. Commun ic ati on wi th other chi ps / commun ic ati on wi th o wn memory . • m ax_C OM _R: ex ternal memor y BW / i nner memor y BW (0.1 in c ase o f ADM -X RC -6T 1 c ar ds) • ma x_P: num ber of a vai labl e FPGAs. In some cases op t_P < ma x_P du e to c ommu ni ca tion requ iremen ts. D LS -Bas ed Bisec tion • Deep es t Le vel s mus t be cut, to ge t parts wi th g ood G -B W pr operty • BF S w av es ar e used to c rea te separ ati ng node se ts (surf aces) Conclu sions  Pr ocessor ar chi tect ur es ar e memor y bandwi dt h l imi ted  N ov el mesh part iti oni ng pr obl em pr esen ted: Cr ea te part s, wi th be tt er loc al ity(G_B W )  Fi rs t pr oof of c oncep t al gori th m cr ea ted: 35% G_B W reduct ion wi th bi secti oni ng (50% is ma x.) 20% be tt er resul ts vs. Me tis (Me tis has di ffer en t g oal !)  Cons tr ai nt on COM ra tio c an be sa tis fied  No vel part iti oni ng appr oach: ge t separ at or s wi th oper at or s de fined on n ode se ts  Fur the r i nv es tig ati ons ar e need ed...

(22)

Problem statements – saf et y re qu irements

•Compact collision avoidance system is not available •Need to avoid other UAVs and small manned airplane (CESSNA) •12.5s detection for human pilot •Our goal is 20s –Detection from 2000m –0.05 degree/pixel

intruder

collision volume (CV)separation minima (SM) collision avoidance threshold traffic avoidance threshold

our track

s

Sensor and compu tational sy stem

•5 pieces of wVGAmicro cameras(willbe replaced to1.2Mpixel) –AptinaMBSV034 sensor –5g –<150mW –3.8mm megapixel objectives(M-12) –70 degrees between two cameras –Total view angle: 220˚x78˚ •FPGA board with Spartan 6 FPGA •Solid State Drive (128Gbyte)

Solid State Drive

FPGA boar

d752x480 cameras to/from navigation computer

Why multiple micro camera? View angle of the acceptable optics is limited. Regular cameras are bulky. Optics for regular cameras are also bulky. Output of the regular cameras are not FPGA friendly.

Proc ess or ar chitectu re in FPGA

•Fovealprocessor hierarchy •Three image processing accelerators –Full frame preprocessor –Grayscale and binary ROI processors •Parallel operation •Optimized for the image processing algorithm •Separate processor for Kalman filters

Memory controllerDRAM Microblaze processor

Image capture Full frame preprocessing Gray scale processor Binary processor

Outli ne

•Problemstatements •Hardware setup –Cameras –Electronics –Mechanics •Software and algorithms –Algorithm overview –Preprocessing –Post processing –Global and local adaptivity •Conclusion

Sensor and compu tational sy stem

•5 pieces of wVGAmicro cameras(willbe replaced to1.2Mpixel) –AptinaMBSV034 sensor –5g –<150mW –3.8mm megapixel objectives(M-12) –70 degrees between two cameras –Total view angle: 220˚x78˚ •FPGA board with Spartan 6 FPGA •Solid State Drive (128Gbyte)

Solid State Drive

FPGA boar

View angl e

Foveal See -and -avoid System for Small UAV

Á. Zarándy, T.Zsedrovits, Z. Nagy, A. Kiss, T. Roska

Fr amework

Plannedon-boardsensor and control loop

Sensor and compu tational sy stem

•5 pieces of wVGAmicro cameras(willbe replaced to1.2Mpixel) –AptinaMBSV034 sensor –5g –<150mW –3.8mm megapixel objectives(M-12) –70 degrees between two cameras –Total view angle: 220˚x78˚ •FPGA board with Spartan 6 FPGA 45T •Solid State Drive (128Gbyte)

Solid State Drive

FPGA boar

(23)

Alg or ith mi c compon ents

•Fovealprocessing: –Preprocessing on the full frame •Identifying candidate points –Post processing •Discarding non-relevant candidate points •Tracking •Multi-level global and local adaptivity

Pos t proc ess ing (ROIs)

•Discard edges of clouds •Significantly reduces the number of candidate points •Resulting few targets can be tracked 1 cutting the perimeter of each object histogram calculation accept candidatepoint

Number of humps? reject candidate point

2

Pos t proc ess ing (ROIs)

•Discard edges of clouds •Significantly reduces the number of candidate points •Resulting few targets can be tracked 1

cutting the perimeter of each object histogram calculation accept candidatepoint Number of humps? reject candidate point

2

Har dw ar e o ver view

•Sensing and processing system –Field of view: 220°x78° –Resolution: ~2250x720 –Frame-rate: 56 FPS –Processor: Spartan6 L45T FPGA –Storage:128Gbyte (23 min) –Size:125x145x45mm (5”x6”x1,8”) –Weight: ~450g (1lb) –Power consumption: <8W

Prepro ces sing (full fr ame)

•Identifies the candidate aerial objects •Finds numerous false targets also •Local adaptation based on edge density •Global adaptation based on number of candidate points contrast calculation locally adaptive contrast thresholding candidate points regions of interest (ROIs)

thresholdadjustment Too many or too fewpoints? y n

Pos t proc ess ing (ROIs)

2

Mechani cs

•Stable camera holder –Alignment –Avoids cross vibration of the cameras –100g –Aluminum alloy –Electronics in the middle –Covered with aluminum plates 10

Prepro ces sing (full fr ame)

•Identifies the candidate aerial objects •Finds numerous false targets also •Local adaptation based on edge density •Global adaptation based on number of candidate points contrast calculation locally adaptive contrast thresholding candidate points regions of interest (ROIs)

thresholdadjustment Too many or too fewpoints? y n

Pos t proc ess ing (ROIs)

2

(24)

Pre - and post proc ess ing + tr ack ing

Red points: all candidate objects Green points: allowed by post processing Blue points: tracked objects

Than k you fo r your attenti on! An y Questi ons?

11.12.2012IRUN Workshop 5-7 November 2012 Budapest3

Outline  The P ro ject an d Ph.Ds  Math em at ical mod el of CN N  CT -CN N impleme nta tio ns, ad vanta ges an d dr awbacks  DT -CNN h ar dwar e impleme nta tio n m etho ds  The pro po sed str ucture an d its archit ec tur e  CNN e mula to r b lo ck  Basic pro ce ss ing u nit  Re conf ig ur ab le an d pro gr ammab le featu res  Impl em enta tio n resu lts

Pre - and post proc ess ing

Red points: all candidate objects Green point: allowed by post processing

Sum mary

•Flyable sensor-processor data acquisition system was built •Algorithm for identifying mid-air intruder was developed •Software porting and validation is needed 11.12.2012IRUN Workshop 5-7 November 2012 Budapest2

The Aim To pr esen t th e ar chit ectur e and math emati cal ba si s of a CNN b ased ima ge pr ocessi ng ha rd war e imp lemented on FPG A devices, which is deve lo ped withi n th e fra mew or k of a resea rch pr oject.

Ex ampl e 1: Gro un d camer a in han d Ex ampl e : Airborne camer a The Mathem atics and FPGA Imp lementat ion of a Class of R eal -Time CNN Processors Vedat T avsanoglu Vedat T avsanoglu

11.12.2012IRUN Workshop November 2012 Budapest

(25)

The P roject Design and Im plemen tation of a Cellular Neural Network Str uctur e for Still and Video Im age Processi ng Using a New F PGA Archit echtur e, support ed by TÜ B İT AK (The Scientifi c and Technol ogi cal Reasear ch Counci lof Turkey , Project No :108 E 023 )

11.12.2012IRUN Workshop 5-7 November 2012 Budapest9

Publi cations

Kayaer,K., and Tavsanoglu, V.: “A New Approach to Emulate CNN on FPGAs for Real Time Video Processing”, CNNA’08 Yildiz, N., Cesur, E., and Tavsanoglu, V.: “A New Control Structure For The Pipelined CNN Processor Arrays”, CNNA'2010 Cesur, E., Yildiz, N., and Tavsanoglu, V.: Architecture of the Next Generation Real Time CNN Processor: RTCNNP-v2, NOLTA'2010 Yildiz, N., Cesur, E. ve Tavsanoglu, V., "Demonstration of the Second Generation Real-Time Cellular Neural Network Processor: RTCNNP-v2", CNNA’2012 Cesur, E., Yildiz, N. ve Tavsanoglu, V., "On an Improved FPGA Implementation of CNN-Based Gabor-Type Filters", TCAS-II 11.12.2012IRUN Workshop November 2012 Budapest12

Mathematical Model of CT CNN can be written as () () () ij ij ij ij xt xt A Y t B U z      ## where ⊛ is the templa te dot pr od uct op erato r.

  () () ()

mm ijijklikjlklikjl kmlm

xt xt a y t b u z

 

     

Outline  Im plem ent ati on of a Real – Ti m e Tw o – Layer CN N Emulat or  Ce ll – sta te equa tio n of an m – neig hb or ho od sp ace – inva ria nt two – la ye r DT – CNN  Processin g un its used f or the c omp uta tio n of the ou tp ut of the fir st and second layers  The two -layer APU  The processor array of the two -layer RTC NN P  Im plem ent ati on of GTF usi ng the RTCNN P

Ph.D. Theses  Evren Cesur: Design and Impl em entat io n of a New F PGA Struct ure to Reali ze Gabor Filt ers , 2012. Covers the modi ficat io n of the proce ssors to im ple me nt GTF cel l- stat e equati ons with com ple x coef fici ents.  Murat han Alpay: Design of a Mult i-Layer CN N Emulat or and its Impl em entati on on an FPGA Devic e, 2012. A topol ogy for the connecti on of proce ssors is proposed for the im ple men tati on of a m ult i-lay er CN N em ulat or that is also capable of carr yi ng out any compl ex val ued com putat ion.

Mathematical Model of CT CNN A n m – ne igh borho od spa ce – invaria nt contin uo us – time CNN is comple tely desc rib ed by th e ce ll state equa tio n   () () ()

mm ijijklikjlklikjl kmlm

xt xt a y t b u z

 

      and the ou tpu tequat ion     1 () () () 1 () 1 2

ijijijij

y t f xt xt xt    

Outline  Mathematica l ov erv iew of Contin uous -and Discret e- Ti m e2 -D Gab or -T yp e Filters  Im plem entat ion of Real -T im e 2- D Gabor -T ype Fil ter s

Ph.D. Theses  Kamer Kayaer: Design and Impl em entat io n of a New CNN Emulator on FP GA for Real Time V ideo Processing , 2008. 1. generat ion RT C NN Mult i-proce ssor  Nerh un Y ildiz: Design of a CNN Emulator and its Impl ementation on an FPG A Devic e, 2012. 2. generat ion RT C NN Mult i-proce ssor wh ic h is highly configurable and programm able.

Publi cations

Cesur, E., Yildiz, N. ve Tavsanoglu, V., "An improved FPGA implementation of CNN Gabor-type filters", ISCAS’11 Cesur, E., Yildiz, N. ve Tavsanoglu, V., "Demo: An improved FPGA implementation of CNN Gabor-type Filters", CNNA’12 Yildiz N., Cesur E., Tavsanoglu V and Kayaer K., "Architecture of a Fully- Pipelined Real-time Cellular Neural Network Emulator“, unpublished Alpay M., Yildiz N and Tavsanoglu V., "A New Real-Time Multi-Layer Cellular Neural Network Implementation“, unpublished

Proceedings of the IRUN Workshop on the Theory and Practice of Kiloprocessor Architectures

PROCEEDINGS OF THE IRUN WORKSHOP ON THE THEORY AND PRACTICE OF KILOPROCESSOR ARCHITECTURES

FACULTY OF INFORMATION TECHNOLOGY PÁZMÁNY PÉTER CATHOLIC UNIVERSITY

BUDAPEST, HUNGARY

2012

Faculty of Information Technology Pázmány Péter Catholic University

Conference Proceedings

PROCEEDINGS OF THE IRUN WORKSHOP ON THE THEORY AND PRACTICE OF KILOPROCESSOR ARCHITECTURES

FACULTY OF INFORMATION TECHNOLOGY PÁZMÁNY PÉTER CATHOLIC UNIVERSITY

BUDAPEST, HUNGARY November 5-7, 2012

Pázmány University ePress

Budapest, 2012

© PPKE Információs Technológiai Kar, 2012

Kiadja a Pázmány Egyetem eKiadó 2012

Budapest

Felelős kiadó

Ft. Dr. Szuromi Szabolcs Anzelm O. Praem.

a Pázmány Péter Katolikus Egyetem rektora

Készült a

TÁMOP-4.2.2/B-10/1-2010-0014 projekt keretében, és az Új Széchenyi Terv támogatásával

ISBN 978-963-277-041-3

Pázmány Péter Catholic University Faculty of Information Technology

Proceedings of the IRUN Workshop on the Theory and Practice of Kiloprocessor Architectures

November 5-7, 2012

Budapest, Hungary

Organizers

Pázmány Péter Catholic University Faculty of Information Technology IRUN: International Research Universities Network

IEEE Hungary Section

Sponsor European Union

Promotion of Excellence at the Pázmány Péter Catholic University

TÁMOP-4.2.2/B-10/1-2010-0014

Scientific Committee Prof. Tamás Roska

Member of the Hungarian Academy of Sciences Head of the Jedlik Laboratory, PPCU

Prof. Ákos Zarándy, DSc

Computer and Automation Research Institute of the Hungarian Academy of Sciences, Research Advisor

Prof. Péter Arató

Member of the Hungarian Academy of Sciences Budapest University of Technology and Economics (BME)

Prof. Ferenc Friedler, DSc

Department of Computer Science and Systems Technology, University of Pannonia

Organizing Committee Prof. Péter Szolgay

Marianna Szalay

Contact: irunbud2012@itk.ppke.hu

Contents

Prof. Tamás Roska Physical and Virtual Cellular Machines for Nanoscale

Architectures 8

Prof. Ronald Tetzlaff High-speed optical control systems in industrial manufacturing processes based on Cellular Neural Networks

12

Csaba Nemes Automatic generation of locally controlled arithmetic unit

via floorplan based partitioning 18

Antal Hiba Improving data locality for mesh computations 20 Tamás Zsedrovits Foveal See-and-avoid System for Small UAV 22 Prof. Vedat Tavsanoglu The Mathematics and FPGA Implementation of a Class of

Real-Time CNN Processors 24

András Kiss Precision of Arithmetic Units for PDE solvers on FPGA 34 Zoltán Nagy Investigation of FPGA Based Acceleration of

Computational Fluid Flow Simulation on Unstructured Mesh Geometry

37

Vamsi Adhikarla View Synthesis for Lightfield Displays using Region-

Based Non-Linear Image Warping 39

László Füredi Hardware Acceleration of 3D TLM Method with FPGA 41 Dr. Müstak E. Yalcin Randomly Reconfigurable Processor Population 43 Prof. Péter Arató High-Level Synthesis of a Task-Oriented Multiprocessing

Structure 49

Prof. Ákos Zarándy Retina based approaching object detection

implementation on kilo-processor device 52 János Rudan Parallel Methods Applied to the Computation of

Dynamically Equivalent Structures of Biochemical Reaction Networks

56

Ádám Rák RACER: Joint instruction-data-stream-based

kiloprocessor architecture and programming procedure 57 Dr. Xavier Vilasis-Cardona Multicore trends in high energy physics applications 59 Prof. Ferenc Friedler Effective Mathematical Modeling for Process Systems

Engineering 65

Prof. Alexandru Gacsádi Variational Computing Based Image Processing Methods

by using Cellular Neural Networks 74

Prof. Péter Szolgay Toward Mega-Core Processor Architectures 78

Spi king Neu ral Netwo rkArchite cture • 130 nm CMOS Technnol ogy • 18 cor es per chi p, 48 chi ps per boar d, • 57,6 00 cuspom chi ps • 50 -100 kW (1/10 0 of t he g ener al purpos e sup er compu ter , 10 0 times bi gg er than an anal og CMOS • IEE E S pectru m, Augus t 201 2, pp.4 0- 43

Physical an d Vi rtual Cellu lar Machi nes for Nan oscale Archit ectu res Ta m ás ROSK A P á zm á ny University , the University of N otre Da me,and the Hun garian Ac ade m y of Sciences , Bud ape st, Hun gary Budapes t, Nove mbe r 5 -7, 2012

Th eor y and Practice o f Ki lo pro ces sor Ar chit ec tur es , IRU N Work shop, 201 2 … wh y?

VCMa : Na vie r- St ok es PDE VCM1 VCM2

Pr oacti ve Ad ap tiv e Cel lul ar Engi ne (P ACE )

The Design Sc enar io Mappi ng Alg orithms Equ iv alen t T rans form ation s Op er ation on Pr oc esso r / memo ry topogr aph y Alg orithms on Da ta / objec t topogr aph y Task/Pr oblem /W orklo ad Ph ysic al Impleme nt ation

O -CNN – osc illa tory CNN x

A B C

O -CNN AM clusters Si gna ture

General cellular mod

el Simple 1D cellula r mod el Signa tur e

Ronald T et zlaff