transm itt ingte lepresence app l icat ions

SupportinganATVSsystemwithhead modelsrequiresinformat ionontherepresenta-tionoftheATVSsystem.InearlierresultsweusedPCAparameters. Facerendering parametersarebasedon measurements. Ifthesystemhastousehead modelsby graphicaldesigners,the maincomponentstatesoftheheadshouldbeclear lyformu-lated. Graphicaldesignerscannotdrawprincipalcomponentssincetheseabstractions cannotbeexaminedinthemselvesinnature. Designerscandrawvisemestates.

Foraudiopreprocessingweused MFCC.Insomeapplicationsthereareotheraudio preprocessingincluded,inthecaseofaudiotransmission mostlySpeex[47].

Iwilldescribeamethodofeasyenhancementofaudiotransmittingte lepresenceap-plicationsusingit’sinternalSpeexpreprocessorandproducingresultswhichiscapable torendervisualspeechfromvisemestates.

6 .1 Method

Inon-linecyberspacesthereareartiﬁcialbodieswhichimitaterealisticbehav iorcon-trolledbyremoteusers. Animportantaspectistherealisticfacial motionofhuman likecharactersaccordingtotheactualspeechsounds. Thischapterdescribesa mem-oryandCPUeﬃcient methodforvisualspeechsynthesisforon-lineapplicationsusing voiceconnectionovernetwork. The methodisreal-time,canbeact ivatedonthere-ceiverclientwithoutserversupport.Itisneededonlytosendcodedspeechsignaland thevisualspeechsynthesisisthetaskofthereceivingclient. Theanimat ionrender-ingissupportedbygraphicalacceleratordevices,so CPUloadoftheconversionis insigniﬁcant.

58 ⁶. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

6.1.1 Introduction

Voice drivenvisualspeechsynthesishasagrowingpopularityincybertelepresence applications. Asof2009thereare morevideogamesonthe marketwiththebeneﬁts ofthistechnology.

The mostpopularuseofvisualspeechsynthesisisthereal-t imerenderedpre-calculatedfacialanimation. This meetsalltherequirementsinanartiﬁcial world wherethecontentofthevoicesisgivenbythedesigners,itisrecorded withvoice actors,andthereistimetodoallthecalculationsduringproductiontime. Anexample ofthistechnologyisinthetitle OblivionorFallout3from BethesdaSoftworks[48] whichusesthe MPEG-4basedFaceGen[13]. Howeverthisapproachisextendableto real-timeapplicationsaswellbyconcatenativesynthesis, wewillseethatitisnota reallysuitablesolution.

Inareal-timetelepresenceapplicationtheplayeractivatesthetransmission,the clientsiderecordsthevoiceinsmallchunks,andsendittotheserverwhichforwardsit tothegivensubsetoftheplayers,teammatesoranycharactersnearby. Duringactive voicetransmissionthevisualfeedbackonthereceiverclientsideissomevisualeﬀectof thecharacter,likeanicon,alighteﬀect,abasicorrandomfacial motion. Anexample ofbasicfacial motionisintheCounter-StrikeSource[49],wherethe momentaryvoice energyisvisualizedbythemovementofthejaw. Wewillusethisapproachasbaseline. Oursolutionisareplacementofthiswithimprovedquality,allowingevenlipreading. 6.1.2 Overview

Real-timeorpre-calculated motioncontrol

Incaseofproductiontime methodsalloftheaudiocontentisavailableinadvance. A typicalexampleofthisstartsfromscreenplay,andthevoicerecordsarebasedonthe giventext. Therearesolutionstoextractphonemestringfromtext,andtosynchronize thisphonemestringtotherecordslike Magpie[50]forexample. Voicesynchronized phonemestringscanbeusedtocreatevisemestringwithvisualco-articulation. The visemeisthebasicunitofvisualspeech(Fig 6.2),practicallythevisualconsequence ofpronouncationofaphoneme. Thevisemestring withtimingincludesthevisual information,andco-articulation methodshastoformitintoanaturalvisualﬂow. Visemecombinations were mappedforinteractionsasdominationor modifying,and withthisknowledge,visemepairsorlongersubsequencesareusedforthesynthesis.

Also,duringproductiontimethespeechsignalisavailableasa wholesentence. This makesthose methodsusable whichusesdataforagivenframefromthevoice ofnextframes. Thisinformationdeﬁnitelyimportantforprecisefacial motion[38,6]. Real-time methodsarenotallowedtouselongbuﬀersbecauseofthedisturbingdelay.

Oneofthereal-timeapproachesusesautomaticspeechrecognition(ASR)systemto extractphonemestringfromthevoice[6]. Thebeneﬁtofthisapproachisthecompat i-bilitywithvisemestringconcatenator methodsbysimplyuseASRinsteadofmanually extractedannotatedphonemestringinformation. TheASRsystemcanbetrainedon

6.1.3Face model 59

Figure6.1: Ourreal-time method:thevisualspeechanimationparametersareca lcu-latedfromtheSpeexcodingparameterswithneuralnetwork.

usualspeechdatabaseswithoutvisualdata. Thedrawbackisthet imeandspacecom-plexityoftherecognition,andthepropagationoftherecognitionerrors,becauseofthe falselycategorizedphonemesorwords.

Other wayisthedirectconversion whichissimplerandfasterbutusuallyless accuratebecauseofthelackoflanguagedependentinformation.

6.1.3 Face model

Foraspeakinghead modeltherearetworequirements:theartistsshoulddesignthe modeleasily,anditshouldhaveenoughdegreeoffreedom. Forexampleinthegame Counter-StrikeSourcethe mouth motionhasonedegreeoffreedom,thepositionof thejaw,anditisdirectlylinkedtotheenergyofthesignal. Althoughthisbehaves obviouslyartiﬁcial,thisisnumericallyafairapproximationsincevisualspeechPCA (PrincipalComponentAnalysis)factorizationshowsthatthe90%ofthedeviationisin theﬁrstprincipalcomponentwhichmainlyshowsthemotionofthejaws[8].Inorderto havea moresophisticatedhead modelthereshouldbe moredegreeoffreedom,which includesthehorizontal motionofthe mouthboundaryor more.

Inour works weused PCAbasedfacialcoordinatestorepresentafacialstate. Thisrepresentationhavesomenicepropertiesas mathematicallyproven maximum compressionratealonglinearbasisesindimensioncount. Eachstateisexpressedin anoptimalbasiscalculatedfromvisualspeechdatabase.Inthiswaya30dimensional facialdatacanbecompressedinto6dimensionwithonly2%error. Aswevisualizethe calculatedbasis,thecoordinatesshow motioncomponentsasjaw motion,liprounding, andsoon. Thesearenotvisemesasvisemesarenotguaranteedtobeorthogonalsof eachother. WeusedguidedPCAinthiscase,includingthe mostimportantvisemes aslongasitispossible.

Foradesignerartistitiseasiertobuild multiple modelshapesindiﬀerentphases thanbuildingone modelwiththecapabilityofparameterdependentmotionbyimp le-mentingrulesinthe3Dframework’sscriptlanguage. The multipleshapesshouldbe theclearstatesoftypical mouthphases,usuallythevisemes,sincethesephasesare easytocapturebyexample. Adesigner wouldhardlycreatea model whichisina theoreticalstategivenbyfactorization methods.

60 ⁶. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

Thereforeweneedtogiveafacialanimationcontrolbasedonfacestates,andthe designercanworkwithexamples. Thecontrolofthefacialanimationcanbetheweights ofthedrawnshapes. Generallyitisnottruethateveryfacialstatecanbeexpressed fromanysetofvisemes,butthereisanapproximationofthestatesforagivenviseme set,anddependingonthesizeofthisset,andsothedegreeoffreedom,anylevelof accuracycanbereached(Table6.1). Thisapproach mayuse moredegreesoffreedom thanPCAbasedapproachforthesamequality,sincethePCAisoptimalinthispoint ofview.

Therenderingofthefaceiseﬃcient. Thegraphicalinterfacesusuallyprovide hardwareacceleratedvertexblending. Thereare moresophisticatedapproachesusing volumeconservertransformations[51]withslightincreaseoftimecomplexity. These methodscanbeusedtorenderourapproachaswell.Supportforfeatureslikecrinkling skinisoutofourinterest.

6.1.4 Visemebaseddecomposition

Thevideodataisfromavideorecordingofatalkingpersonwithﬁxedﬁeldofview. Theheadofthepersonwasﬁxedtothechairtoeliminatethe motionofthe whole head. Thefaceoftheperson wasprepared with markerpoints which weretracked automaticallyandcorrected manually[8]. Thepositionofthenosewasusedasorigin, soeveryframewastranslatedtocommonframe. Theautomatictrackingwasbasedon colorsensitivehighlighttrackingwithautomaticqualityfeedbacktohelpthe manual corrections. There were15 markers,placedonasubsetof MPEG-4featurepoints. The markertrackerresultsavectorstreamin2Dpixelspace. Thisrepresentationis goodfor measurement,becausenospecialequipmentisneededfortherecording,and alsorelativelygoodforestimationofqualitysincethegeneratedanimation willgive thesamedatainbestscenario. Toachieveinterchangeable metrics,pixelunit must beeliminated. Thiseliminationisdonebyusingthedistributionofthegiven marker positionasareferencetothepositionerror.Inthiscasethepixelunitsareeliminated fromtheresultbytransformingtoarelativescale.

E(G)=

Ni=1 f

j=1 Gⁱ_j−S_jⁱ

Nfσ(Sⁱ) (6.1)

whereN isthedimensionalityofthevisualrepresentation,fisthetotalnumber offrames,GisthegeneratedsignalversusSsignalandEistheestimatederrorvalue. Scanbeanylinearrepresentationofthefacialstate,giveninpixelsorvertices,or MPEG-4FAPvalues.

Everyfacialstateisexpressedasaweightedsumoftheselectedvisemestatesets. Thedecompositionalgorithmisasimpleoptimizationoftheweightvectorsofviseme elementsresulting minimalerrors. Thevisemesaregiveninpixelspace. Everyframe ofthevideoisprocessedindependentlyintheoptimization. Weusedpartialgradient methodwithaconstraintofconvexnesstooptimizetheweightswherethegradientwas basedonthedistanceoftheoriginalandtheweightedvisemesum(Equation6.1). The

6.1.4 Visemebaseddecomposition 61

Table6.1: Errorsasafunctionofvisemenumberusedincompositionsonimportant featurepoints. Visemesarewrittenincorrespondingphonemecodes,except”closed”

andSwhichisnotastandalonedominantviseme,itisusedatthebeginningofthe word.

6 closed2IOES* 3.48%

constraintisasuﬃcientbutnotnecessaryconditiontoavoidunnaturalresultsastoo big mouthorhead,thereforenonegativeweightsallowed,andthesumoftheweights isone.Inthiscaseastepinthepartialgradientdirection meansalargerchangein thedirectionandasmallchangeintheremainingdirectionstobalancethesum. The approximationisacceleratedandsmoothedbychoosingthestartingweightvectorfrom thelastresult.

ThestateGcanbeexpressedasconvexsumofvisemestatesV,whichcanbeany linearrepresentation,aspixelcoordinatesor3Dvertexcoordinates.

Theconvexnessguaranteesthattheblendingisindependentofthecoord inatesys-tem.Ifthedesigneruseunnormalizedvertexcoordinates,aweightedsumwith more orlessofweightsumofonecanresulttranslationand magniﬁcationofthehead.

Theresultsofthissimpleapproximationareacceptable. Thequalityisestimated bypixelerrorsoftheimportantfacialfeaturepoints. Theselectionofimportantpoints isbasedondeviation,thosefeaturepointswhichareabovetheaveragedeviationare chosen.

Thehead modelusedinsubjectivetestsisthreedimensionalandthiscalculation isbasedontwodimensionalsimilarities,sothephasedecompositionisbasedonthe assumptionthattwodimensional(frontalview)similarityinducethreedimensional similarity. Thisassumptionisnumericallyreasonablewithprojection.

Notethattherepresentationqualityisscalablebysettingthevisemecount. This will maketheresulting methodscalableonclientside.

62 ⁶. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

Figure6.2: Visemesarethebasicunitofvisualspeech. Thesearethosevisemeswe usedforsubjectiveopinionscoretestsinthis(row-major)order.

6.1.5 Voicerepresentation 63

6.1.5 Voicerepresentation

Every voiceprocessing methodneedtoextractusefulinformationfromthesignal. Thosealgorithmswhichusedirectlythesoundpressuresignalarecalledtimedomain, whichusesFouriertransformorotherfrequencyrelatedﬁlterbanksarecalledfrequency domain,andthose whichusessome(lossy)compressedinputarecalledcompressed domain methods.

Thoseapplications,wherevoicedrivenfacialanimationcanbea matter,usevoice transmission. Voicetransmissionsystemsuselossycompression methodsto minimize thenetworkload. Thereforeaneﬃcientvoicedrivenvisualspeechsynthesizershould beacompresseddomain method.

Aspeechcoderattendstoachievebestvoicequality withreasonablesizeddata packets. Thiscanbetreatedasafeatureextracting method. Thequestionis, what distancefunctioncanbeusedonthegivenrepresentation?Isthereanappropriate metricswhatalearningsystemcanapproximate?

6.1.6 Speexcoding

OneofthemostpopularspeechcoderforthispurposeistheSpeex[47].SpeexusesLSP, amemberofthelinearpredictioncodingfamily.Linearpredictionuseavectorofscalars whichcanpredictthenextsamplefromtheprevioussamplesbylinearcombination.

x_n= ^N

i=1

aixn−i (6.4)

WhereNisthesizeofthepredictionvector. Theoptimalpredictorcoeﬃcientvector foragivenxcanbecalculatedbyLevinson-Durbinalgorithm. Th isisshortrepresenta-tion,butitisnotsuitableforquantizationorlinearoperationsaslinearinterpolation, consequentlyitisnotdirectlyusedforvoicetransmissionorfacialanimat ionconver-sion. HenceSpeexusesLSPwhichisaspecialrepresentationofthesameinformation butcapabletolinearoperations,forexampletheLSPvaluesarelinearlyinterpolated betweenthecompressedframesofSpeex.

ForLSPcoding,insteadofstoringthepredictorvectorawetreatitasapolynomial, andstoretheroots. Therootsareguaranteedtobeinsidetheunitcircleofthecomplex plane. Toﬁndroots,twodimensionalsearchwouldbeneeded,sotoavoidthisweuse apairofpolynomialswhichareguaranteedtohavealltherootsontheunitcircle,and the meanofthepairistheoriginalroot,soonedimensionalsearchisenough.

PQz= az±z^−(N+1)a_z⁻¹ (6.5)

LSP=

z∈C

{PQz=0} (6.6)

64 ⁶. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

This makesLSP morerobusttoquantizationandinterpolationthanthepredictor vector.Interestingly,PQvaluesarecalledvocaltract, withglottisopenandclosed, whichareconnectedwiththetopicofaudiovisualspeechsynthesis.

Lossycompressionmethodsusequantizationofvaluesofacareful lychosenrepresen-tation.LSPisacompactandrobustrepresentation,andSpeexuseVectorQuantization tocompressthesevalues. We modiﬁedtheSpeexdecod ingprocesstoexportuncom-pressedLSPvaluesandtheenergy. Thismakesonly11assignmentsandmultiplications forscalingasanextracomputationalcost.

6.1.7 Neuralnetworktraining

Thedataisfromanaudiovisualrecordingofaprofessionallip-speaker. Therecord-ingcontains4250frames. Thecontentisintendedfordirectvoicetovisualspeech conversiontestingfordeafpeople,itcontainsnumbers, months,etc. Thelanguageis Hungarian. Thenetworkisasimplestraightforwarderror-backpropagationnetwork withonehiddenlayer.

Audio

Theaudiorecordingisoriginally48kHz,anditisdownsampledto8kHzforSpeex. Weusedthe modiﬁedSpeexdecodertoextractLSPandgainvaluestotrainneural networksasinput. Therearevaluesforeach20 mswindow.LSPhasvaluesin[0,π], andtheneuralnetworkusethe[−1,1]interval,soscalingwasapplied.

Video

Thetargetoftheneuralnetworkisthevisemeweightvectorrepresentingfacialstate. Astheoriginalrecordingis25framepersecond,andtheaudiodatafromSpeexuses 20 mswindows,thevideodatawasinterpolatedfrom40 msto20 msframeinterval. Weusedlinearinterpolationasitnotviolatesconvexness. Thedecompositionweight valuesareintherangeof[0,1]whichisintheneuralnetworks[−1,1]interval,sono scalingwasapplied.

Neuralnetworkusage

Theresultingnetworkisintendedtobeuseddirectlyinthehostapplication. The trainednetwork weightscanbeexportedasastaticfunctionofaprogramm inglan-guage,forexample C++. Thissourcecodecanbecompiledintotheclient. This functioniscalledwiththevaluesexportedfromthemodiﬁedSpeexcodec. Thereturn-ingvaluesisapplieddirectlyfortherenderer. Withthisapproachruntimeoverhead is minimal,noﬁlereadingsordatastructuresareneeded. Thegeneratedsourcecode canbecreatedatspeechinterestedlaboratories,theapplicationdevelopersjustusethe code.

6.1.8Implementationissues 65

6.1.8 Implementationissues

The methodcanbeimplementedasafeatureontheclient,onreceiverside. Theuser mayturnonandoﬀthe methodsincethecalculationareperformedonthereceiver clientsCPU.Thereisnoextrapayloadonthenetworktraﬃc.

TheCPUcostofthecalculationis200-400multiplicationsdependingonthehidden layersizeandthedegreesoffreedom. Thespacecostofthefeatureisthe multiple shapesofthehead models,whichdependsonthegivenapplication,howsophisticated head modelsareusedinit. Thespacecostisscalablebysettingthevisemeset,the morehead modelsthebetterapproximationofthereal mouth motion.

Thehead modelscanbestoredonvideoacceleratordevice memoryandcanbe manipulatedthroughgraphicalinterfacesasOpenGLorDirect3D.Thevertexblending (weightedvectorsum)canbecalculatedontheacceleratordevice,itishighlyparallel sincetheverticesareindependent.

6.1.9 Results

Trainingandtestingsetwasseparated,andduringtheﬁrst1‘000‘000epochs(training cycles)oftrainingtheerrorofthetestingsetstilldecreased(Fig6.3). Dependingonthe degreesoffreedomtheresultsare1-1.5%ofaverageerror. Ourformer measurements gavesuﬃcientintelligibilityresultsatthislevelofnumericerror. Thisshowsthat usabletrainingerrorlevelcanbereachedbeforeovertrainingevenwithrelativelysmall databases.

ThedetailsofthetrainedsystemresponsecanbeseenonFig 6.4. The main motionﬂowisreproduced,andtherearesmallglitchesbilabialnasals(lipsnotclose fully)andplosives(visibleburstframe). Mostoftheseglitchescouldbeavoidedusing longerbuﬀer,butitcausedelayintheresponse.

Subjectiveopinionscoretestwasdonetoevaluatethevoicebasedfacialanimation withshortvideos. Halfofthetest materialwasfacepicturecontrolledbydecomposed dataandtheotherhalfbyfacialanimationcontrolparametersgivenbytheneural networkbasedcontroldatafromoriginalspeechsounds.Theopinionscoretestincluded from1to5degreesoffreedomofcontrolparameters.Eachcontrolsourceanddegreesof freedomcombinationwasrepresentedin8shortvideo,2ofthempronouncednumbers 0-9,2ofthemnumbers10-99,2withnamesofthe monthsand2withthedaysofthe week. This makes80videos.

Testsubjectswereinstructedtoevaluateharmonyandnaturalnessoftheconnection ofvisualandaudiochannels. Score5forperfectarticulation,score3for mistakes andscore1forhardlyrecognizableconnection. Theresultsareinterestingsinceafter theseconddegreeoffreedomtheevaluationisneartoconstantwhilenumericalerror halvesbetween2. and3. degree. Thepossibleexplanationofthisphenomenacan bethesimplenessofourhead modelusedforscoring. Thetongueandtheteethwas notindependently movedinthevideos,the moredegreeoffreedomwasusedonlyto approximatethe mouthcontour morepreciselywhich maywaspreciseenoughalready atlowerdegreesoffreedom.

66 ⁶. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

10¹ 10² 10³

0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 0.022

Epochs

Error

1 DoFtrain 1 DoFtest 2 DoFtrain 2 DoFtest 3 DoFtrain 3 DoFtest 4 DoFtrain 4 DoFtest 5 DoFtrain 5 DoFtest

Figure6.3: Trainingandtestingerrorofthenetworkduringtraining. Theerroristhe averagedistancebetweentheweightfromthevideodataandthecalculatedfromSpeex input.

6.1.9 Results 67

Figure6.4:Exampleswiththehungarianword”Szeptember”,it’sveryclosetoEnglish

”September”exceptthelasteisalsoopen. Eachﬁgureisthe mouthcontourinthe time. Theoriginaldataisfromavideoframesequence. The2and3DoFaretheresult ofdecomposition. Thelastpictureisthevoicedrivensynthesis.

68 ⁶. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

Highertargetcomplexityinducetheneuralnetworkconvergeslower,ornotatall. Butasweincreasethedegreesoffreedom,theneuralnetwork’serrordecreasesfrom the2.degree.

Theresultsoftheopinionscoretestshowthatthebestscore/DoFrateisatthe2 DoF(Fig6.5),infactthehighestnumericalerror. Theseresultsshowthattheneural network maytraintodetails whicharenotveryimportanttothetestsubjects. As thedecompositionisbasedentirelyandonlyon mouthcontour,it maybenotthat important. Usingcorrectteethvisibilityortongue movement mayimprovetheresults, butinthistestwewereunabletotrythisbecauseofthelackofmarkersonthesefacial organs. Thisproblemisinthedecompositionphasesinceinthesynthesizedfacewe havethesefacialorgansandcontrolthemactively,butthecontrolisinaccurate.Ifthe decompositionwouldbeaﬀectedby moreinformation,thiscouldbecorrected. Active shapemodelingorotheradvancedtechniquesmayimprovethedecompositionmaterial. The mainconsequenceofthesubjectivetestthattwodegreesoffreedomcangive suﬃcientqualityforaudiovisualspeech,andtheproposed methodcangivethecontrol parametersinthisqualityfromthevoicesignal.

6.1.10 Conclusion

The mainchallengewasthestrangerepresentationofthevisualspeech. Wecansay oursystemwassuccessfullyusedthisrepresentation.

ThepresentedmethodiseﬃcientastheCPUcostislow,thereisnonetworktraﬃc overhead,thefeatureextractionofthevoiceisalreadyperformedbyvoicecompression, andthespacecomplexityisscalablefortheapplication. Thefeatureisindependent fromtheotherclients,canbeturnedonwithoutexplicitsupportfromtheserveror otherclients.

Thequalityofthe mouth motionwas measuredbysubjectiveevaluation ,thepro-posedvoicedrivenfacialmotionshowssuﬃcientqualityforon-linegames,signiﬁcantly betterthantheonedimensionaljaw motion.

Letusnotethatthesystemdoesnotcontainanylanguagedependentcomponent, models. Theresultingsystem overperformsthebaseline ofthe widelyusedenergybasedinterpolationoftwovisemes.[52]

6.2. THESIS 69

Figure6.5:Subjectivescoresofthedecomposition motioncontrolandtheoutputof theneuralnetwork. Thereisasigniﬁcantimprovementbyintroducingaseconddegree offreedom. Themethod’sjudgmentfollowsthedatabase’saccordingtothecomplexity ofthegivendegreeoffreedom.

6.2.1 Novelty

FacialparametersusuallyrepresentedwithPCA.Thisnewrepresentationisawareof thedemandsofthegraphicaldesigners. Therewerenopublicationsbe foreontheus-abilityifthisrepresentationconcerningATVSdatabasebuildingorreal-timesynthesis. 6.2.2 Measurements

Subjectiveopinionscoreswereusedto measuretheresultingquality. 6.2.3 Consequences

UsingtheSpeexandthevisemecombinationrepresentationtheresultingsystemis embeddableveryeasily.

Summary

InthisproposedthesisIcollected mycontributionstotheﬁeldofaud iospeechconver-siontovisualspeech,especiallydirectconversionbetweenthe modalities.

In document Audiotovisualspeechconversion GergelyFeldhoﬀer (Pldal 63-77)