• Nem Talált Eredményt

transm itt ingte lepresence app l icat ions

SupportinganATVSsystemwithhead modelsrequiresinformat ionontherepresenta-tionoftheATVSsystem.InearlierresultsweusedPCAparameters. Facerendering parametersarebasedon measurements. Ifthesystemhastousehead modelsby graphicaldesigners,the maincomponentstatesoftheheadshouldbeclear lyformu-lated. Graphicaldesignerscannotdrawprincipalcomponentssincetheseabstractions cannotbeexaminedinthemselvesinnature. Designerscandrawvisemestates.

Foraudiopreprocessingweused MFCC.Insomeapplicationsthereareotheraudio preprocessingincluded,inthecaseofaudiotransmission mostlySpeex[47].

Iwilldescribeamethodofeasyenhancementofaudiotransmittingte lepresenceap-plicationsusingit’sinternalSpeexpreprocessorandproducingresultswhichiscapable torendervisualspeechfromvisemestates.

6 .1 Method

Inon-linecyberspacesthereareartificialbodieswhichimitaterealisticbehav iorcon-trolledbyremoteusers. Animportantaspectistherealisticfacial motionofhuman likecharactersaccordingtotheactualspeechsounds. Thischapterdescribesa mem-oryandCPUefficient methodforvisualspeechsynthesisforon-lineapplicationsusing voiceconnectionovernetwork. The methodisreal-time,canbeact ivatedonthere-ceiverclientwithoutserversupport.Itisneededonlytosendcodedspeechsignaland thevisualspeechsynthesisisthetaskofthereceivingclient. Theanimat ionrender-ingissupportedbygraphicalacceleratordevices,so CPUloadoftheconversionis insignificant.

57

58 6. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

6.1.1 Introduction

Voice drivenvisualspeechsynthesishasagrowingpopularityincybertelepresence applications. Asof2009thereare morevideogamesonthe marketwiththebenefits ofthistechnology.

The mostpopularuseofvisualspeechsynthesisisthereal-t imerenderedpre-calculatedfacialanimation. This meetsalltherequirementsinanartificial world wherethecontentofthevoicesisgivenbythedesigners,itisrecorded withvoice actors,andthereistimetodoallthecalculationsduringproductiontime. Anexample ofthistechnologyisinthetitle OblivionorFallout3from BethesdaSoftworks[48] whichusesthe MPEG-4basedFaceGen[13]. Howeverthisapproachisextendableto real-timeapplicationsaswellbyconcatenativesynthesis, wewillseethatitisnota reallysuitablesolution.

Inareal-timetelepresenceapplicationtheplayeractivatesthetransmission,the clientsiderecordsthevoiceinsmallchunks,andsendittotheserverwhichforwardsit tothegivensubsetoftheplayers,teammatesoranycharactersnearby. Duringactive voicetransmissionthevisualfeedbackonthereceiverclientsideissomevisualeffectof thecharacter,likeanicon,alighteffect,abasicorrandomfacial motion. Anexample ofbasicfacial motionisintheCounter-StrikeSource[49],wherethe momentaryvoice energyisvisualizedbythemovementofthejaw. Wewillusethisapproachasbaseline. Oursolutionisareplacementofthiswithimprovedquality,allowingevenlipreading. 6.1.2 Overview

Real-timeorpre-calculated motioncontrol

Incaseofproductiontime methodsalloftheaudiocontentisavailableinadvance. A typicalexampleofthisstartsfromscreenplay,andthevoicerecordsarebasedonthe giventext. Therearesolutionstoextractphonemestringfromtext,andtosynchronize thisphonemestringtotherecordslike Magpie[50]forexample. Voicesynchronized phonemestringscanbeusedtocreatevisemestringwithvisualco-articulation. The visemeisthebasicunitofvisualspeech(Fig 6.2),practicallythevisualconsequence ofpronouncationofaphoneme. Thevisemestring withtimingincludesthevisual information,andco-articulation methodshastoformitintoanaturalvisualflow. Visemecombinations were mappedforinteractionsasdominationor modifying,and withthisknowledge,visemepairsorlongersubsequencesareusedforthesynthesis.

Also,duringproductiontimethespeechsignalisavailableasa wholesentence. This makesthose methodsusable whichusesdataforagivenframefromthevoice ofnextframes. Thisinformationdefinitelyimportantforprecisefacial motion[38,6]. Real-time methodsarenotallowedtouselongbuffersbecauseofthedisturbingdelay.

Oneofthereal-timeapproachesusesautomaticspeechrecognition(ASR)systemto extractphonemestringfromthevoice[6]. Thebenefitofthisapproachisthecompat i-bilitywithvisemestringconcatenator methodsbysimplyuseASRinsteadofmanually extractedannotatedphonemestringinformation. TheASRsystemcanbetrainedon

6.1.3Face model 59

Figure6.1: Ourreal-time method:thevisualspeechanimationparametersareca lcu-latedfromtheSpeexcodingparameterswithneuralnetwork.

usualspeechdatabaseswithoutvisualdata. Thedrawbackisthet imeandspacecom-plexityoftherecognition,andthepropagationoftherecognitionerrors,becauseofthe falselycategorizedphonemesorwords.

Other wayisthedirectconversion whichissimplerandfasterbutusuallyless accuratebecauseofthelackoflanguagedependentinformation.

6.1.3 Face model

Foraspeakinghead modeltherearetworequirements:theartistsshoulddesignthe modeleasily,anditshouldhaveenoughdegreeoffreedom. Forexampleinthegame Counter-StrikeSourcethe mouth motionhasonedegreeoffreedom,thepositionof thejaw,anditisdirectlylinkedtotheenergyofthesignal. Althoughthisbehaves obviouslyartificial,thisisnumericallyafairapproximationsincevisualspeechPCA (PrincipalComponentAnalysis)factorizationshowsthatthe90%ofthedeviationisin thefirstprincipalcomponentwhichmainlyshowsthemotionofthejaws[8].Inorderto havea moresophisticatedhead modelthereshouldbe moredegreeoffreedom,which includesthehorizontal motionofthe mouthboundaryor more.

Inour works weused PCAbasedfacialcoordinatestorepresentafacialstate. Thisrepresentationhavesomenicepropertiesas mathematicallyproven maximum compressionratealonglinearbasisesindimensioncount. Eachstateisexpressedin anoptimalbasiscalculatedfromvisualspeechdatabase.Inthiswaya30dimensional facialdatacanbecompressedinto6dimensionwithonly2%error. Aswevisualizethe calculatedbasis,thecoordinatesshow motioncomponentsasjaw motion,liprounding, andsoon. Thesearenotvisemesasvisemesarenotguaranteedtobeorthogonalsof eachother. WeusedguidedPCAinthiscase,includingthe mostimportantvisemes aslongasitispossible.

Foradesignerartistitiseasiertobuild multiple modelshapesindifferentphases thanbuildingone modelwiththecapabilityofparameterdependentmotionbyimp le-mentingrulesinthe3Dframework’sscriptlanguage. The multipleshapesshouldbe theclearstatesoftypical mouthphases,usuallythevisemes,sincethesephasesare easytocapturebyexample. Adesigner wouldhardlycreatea model whichisina theoreticalstategivenbyfactorization methods.

60 6. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

Thereforeweneedtogiveafacialanimationcontrolbasedonfacestates,andthe designercanworkwithexamples. Thecontrolofthefacialanimationcanbetheweights ofthedrawnshapes. Generallyitisnottruethateveryfacialstatecanbeexpressed fromanysetofvisemes,butthereisanapproximationofthestatesforagivenviseme set,anddependingonthesizeofthisset,andsothedegreeoffreedom,anylevelof accuracycanbereached(Table6.1). Thisapproach mayuse moredegreesoffreedom thanPCAbasedapproachforthesamequality,sincethePCAisoptimalinthispoint ofview.

Therenderingofthefaceisefficient. Thegraphicalinterfacesusuallyprovide hardwareacceleratedvertexblending. Thereare moresophisticatedapproachesusing volumeconservertransformations[51]withslightincreaseoftimecomplexity. These methodscanbeusedtorenderourapproachaswell.Supportforfeatureslikecrinkling skinisoutofourinterest.

6.1.4 Visemebaseddecomposition

Thevideodataisfromavideorecordingofatalkingpersonwithfixedfieldofview. Theheadofthepersonwasfixedtothechairtoeliminatethe motionofthe whole head. Thefaceoftheperson wasprepared with markerpoints which weretracked automaticallyandcorrected manually[8]. Thepositionofthenosewasusedasorigin, soeveryframewastranslatedtocommonframe. Theautomatictrackingwasbasedon colorsensitivehighlighttrackingwithautomaticqualityfeedbacktohelpthe manual corrections. There were15 markers,placedonasubsetof MPEG-4featurepoints. The markertrackerresultsavectorstreamin2Dpixelspace. Thisrepresentationis goodfor measurement,becausenospecialequipmentisneededfortherecording,and alsorelativelygoodforestimationofqualitysincethegeneratedanimation willgive thesamedatainbestscenario. Toachieveinterchangeable metrics,pixelunit must beeliminated. Thiseliminationisdonebyusingthedistributionofthegiven marker positionasareferencetothepositionerror.Inthiscasethepixelunitsareeliminated fromtheresultbytransformingtoarelativescale.

E(G)=

Ni=1 f

j=1 Gij−Sji

Nfσ(Si) (6.1)

whereN isthedimensionalityofthevisualrepresentation,fisthetotalnumber offrames,GisthegeneratedsignalversusSsignalandEistheestimatederrorvalue. Scanbeanylinearrepresentationofthefacialstate,giveninpixelsorvertices,or MPEG-4FAPvalues.

Everyfacialstateisexpressedasaweightedsumoftheselectedvisemestatesets. Thedecompositionalgorithmisasimpleoptimizationoftheweightvectorsofviseme elementsresulting minimalerrors. Thevisemesaregiveninpixelspace. Everyframe ofthevideoisprocessedindependentlyintheoptimization. Weusedpartialgradient methodwithaconstraintofconvexnesstooptimizetheweightswherethegradientwas basedonthedistanceoftheoriginalandtheweightedvisemesum(Equation6.1). The

6.1.4 Visemebaseddecomposition 61

Table6.1: Errorsasafunctionofvisemenumberusedincompositionsonimportant featurepoints. Visemesarewrittenincorrespondingphonemecodes,except”closed”

andSwhichisnotastandalonedominantviseme,itisusedatthebeginningofthe word.

6 closed2IOES* 3.48%

constraintisasufficientbutnotnecessaryconditiontoavoidunnaturalresultsastoo big mouthorhead,thereforenonegativeweightsallowed,andthesumoftheweights isone.Inthiscaseastepinthepartialgradientdirection meansalargerchangein thedirectionandasmallchangeintheremainingdirectionstobalancethesum. The approximationisacceleratedandsmoothedbychoosingthestartingweightvectorfrom thelastresult.

ThestateGcanbeexpressedasconvexsumofvisemestatesV,whichcanbeany linearrepresentation,aspixelcoordinatesor3Dvertexcoordinates.

Theconvexnessguaranteesthattheblendingisindependentofthecoord inatesys-tem.Ifthedesigneruseunnormalizedvertexcoordinates,aweightedsumwith more orlessofweightsumofonecanresulttranslationand magnificationofthehead.

Theresultsofthissimpleapproximationareacceptable. Thequalityisestimated bypixelerrorsoftheimportantfacialfeaturepoints. Theselectionofimportantpoints isbasedondeviation,thosefeaturepointswhichareabovetheaveragedeviationare chosen.

Thehead modelusedinsubjectivetestsisthreedimensionalandthiscalculation isbasedontwodimensionalsimilarities,sothephasedecompositionisbasedonthe assumptionthattwodimensional(frontalview)similarityinducethreedimensional similarity. Thisassumptionisnumericallyreasonablewithprojection.

Notethattherepresentationqualityisscalablebysettingthevisemecount. This will maketheresulting methodscalableonclientside.

62 6. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

Figure6.2: Visemesarethebasicunitofvisualspeech. Thesearethosevisemeswe usedforsubjectiveopinionscoretestsinthis(row-major)order.

6.1.5 Voicerepresentation 63

6.1.5 Voicerepresentation

Every voiceprocessing methodneedtoextractusefulinformationfromthesignal. Thosealgorithmswhichusedirectlythesoundpressuresignalarecalledtimedomain, whichusesFouriertransformorotherfrequencyrelatedfilterbanksarecalledfrequency domain,andthose whichusessome(lossy)compressedinputarecalledcompressed domain methods.

Thoseapplications,wherevoicedrivenfacialanimationcanbea matter,usevoice transmission. Voicetransmissionsystemsuselossycompression methodsto minimize thenetworkload. Thereforeanefficientvoicedrivenvisualspeechsynthesizershould beacompresseddomain method.

Aspeechcoderattendstoachievebestvoicequality withreasonablesizeddata packets. Thiscanbetreatedasafeatureextracting method. Thequestionis, what distancefunctioncanbeusedonthegivenrepresentation?Isthereanappropriate metricswhatalearningsystemcanapproximate?

6.1.6 Speexcoding

OneofthemostpopularspeechcoderforthispurposeistheSpeex[47].SpeexusesLSP, amemberofthelinearpredictioncodingfamily.Linearpredictionuseavectorofscalars whichcanpredictthenextsamplefromtheprevioussamplesbylinearcombination.

xn= N

i=1

aixn−i (6.4)

WhereNisthesizeofthepredictionvector. Theoptimalpredictorcoefficientvector foragivenxcanbecalculatedbyLevinson-Durbinalgorithm. Th isisshortrepresenta-tion,butitisnotsuitableforquantizationorlinearoperationsaslinearinterpolation, consequentlyitisnotdirectlyusedforvoicetransmissionorfacialanimat ionconver-sion. HenceSpeexusesLSPwhichisaspecialrepresentationofthesameinformation butcapabletolinearoperations,forexampletheLSPvaluesarelinearlyinterpolated betweenthecompressedframesofSpeex.

ForLSPcoding,insteadofstoringthepredictorvectorawetreatitasapolynomial, andstoretheroots. Therootsareguaranteedtobeinsidetheunitcircleofthecomplex plane. Tofindroots,twodimensionalsearchwouldbeneeded,sotoavoidthisweuse apairofpolynomialswhichareguaranteedtohavealltherootsontheunitcircle,and the meanofthepairistheoriginalroot,soonedimensionalsearchisenough.

PQz= az±z−(N+1)az−1 (6.5)

LSP=

z∈C

{PQz=0} (6.6)

64 6. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

This makesLSP morerobusttoquantizationandinterpolationthanthepredictor vector.Interestingly,PQvaluesarecalledvocaltract, withglottisopenandclosed, whichareconnectedwiththetopicofaudiovisualspeechsynthesis.

Lossycompressionmethodsusequantizationofvaluesofacareful lychosenrepresen-tation.LSPisacompactandrobustrepresentation,andSpeexuseVectorQuantization tocompressthesevalues. We modifiedtheSpeexdecod ingprocesstoexportuncom-pressedLSPvaluesandtheenergy. Thismakesonly11assignmentsandmultiplications forscalingasanextracomputationalcost.

6.1.7 Neuralnetworktraining

Thedataisfromanaudiovisualrecordingofaprofessionallip-speaker. Therecord-ingcontains4250frames. Thecontentisintendedfordirectvoicetovisualspeech conversiontestingfordeafpeople,itcontainsnumbers, months,etc. Thelanguageis Hungarian. Thenetworkisasimplestraightforwarderror-backpropagationnetwork withonehiddenlayer.

Audio

Theaudiorecordingisoriginally48kHz,anditisdownsampledto8kHzforSpeex. Weusedthe modifiedSpeexdecodertoextractLSPandgainvaluestotrainneural networksasinput. Therearevaluesforeach20 mswindow.LSPhasvaluesin[0,π], andtheneuralnetworkusethe[−1,1]interval,soscalingwasapplied.

Video

Thetargetoftheneuralnetworkisthevisemeweightvectorrepresentingfacialstate. Astheoriginalrecordingis25framepersecond,andtheaudiodatafromSpeexuses 20 mswindows,thevideodatawasinterpolatedfrom40 msto20 msframeinterval. Weusedlinearinterpolationasitnotviolatesconvexness. Thedecompositionweight valuesareintherangeof[0,1]whichisintheneuralnetworks[−1,1]interval,sono scalingwasapplied.

Neuralnetworkusage

Theresultingnetworkisintendedtobeuseddirectlyinthehostapplication. The trainednetwork weightscanbeexportedasastaticfunctionofaprogramm inglan-guage,forexample C++. Thissourcecodecanbecompiledintotheclient. This functioniscalledwiththevaluesexportedfromthemodifiedSpeexcodec. Thereturn-ingvaluesisapplieddirectlyfortherenderer. Withthisapproachruntimeoverhead is minimal,nofilereadingsordatastructuresareneeded. Thegeneratedsourcecode canbecreatedatspeechinterestedlaboratories,theapplicationdevelopersjustusethe code.

6.1.8Implementationissues 65

6.1.8 Implementationissues

The methodcanbeimplementedasafeatureontheclient,onreceiverside. Theuser mayturnonandoffthe methodsincethecalculationareperformedonthereceiver clientsCPU.Thereisnoextrapayloadonthenetworktraffic.

TheCPUcostofthecalculationis200-400multiplicationsdependingonthehidden layersizeandthedegreesoffreedom. Thespacecostofthefeatureisthe multiple shapesofthehead models,whichdependsonthegivenapplication,howsophisticated head modelsareusedinit. Thespacecostisscalablebysettingthevisemeset,the morehead modelsthebetterapproximationofthereal mouth motion.

Thehead modelscanbestoredonvideoacceleratordevice memoryandcanbe manipulatedthroughgraphicalinterfacesasOpenGLorDirect3D.Thevertexblending (weightedvectorsum)canbecalculatedontheacceleratordevice,itishighlyparallel sincetheverticesareindependent.

6.1.9 Results

Trainingandtestingsetwasseparated,andduringthefirst1‘000‘000epochs(training cycles)oftrainingtheerrorofthetestingsetstilldecreased(Fig6.3). Dependingonthe degreesoffreedomtheresultsare1-1.5%ofaverageerror. Ourformer measurements gavesufficientintelligibilityresultsatthislevelofnumericerror. Thisshowsthat usabletrainingerrorlevelcanbereachedbeforeovertrainingevenwithrelativelysmall databases.

ThedetailsofthetrainedsystemresponsecanbeseenonFig 6.4. The main motionflowisreproduced,andtherearesmallglitchesbilabialnasals(lipsnotclose fully)andplosives(visibleburstframe). Mostoftheseglitchescouldbeavoidedusing longerbuffer,butitcausedelayintheresponse.

Subjectiveopinionscoretestwasdonetoevaluatethevoicebasedfacialanimation withshortvideos. Halfofthetest materialwasfacepicturecontrolledbydecomposed dataandtheotherhalfbyfacialanimationcontrolparametersgivenbytheneural networkbasedcontroldatafromoriginalspeechsounds.Theopinionscoretestincluded from1to5degreesoffreedomofcontrolparameters.Eachcontrolsourceanddegreesof freedomcombinationwasrepresentedin8shortvideo,2ofthempronouncednumbers 0-9,2ofthemnumbers10-99,2withnamesofthe monthsand2withthedaysofthe week. This makes80videos.

Testsubjectswereinstructedtoevaluateharmonyandnaturalnessoftheconnection ofvisualandaudiochannels. Score5forperfectarticulation,score3for mistakes andscore1forhardlyrecognizableconnection. Theresultsareinterestingsinceafter theseconddegreeoffreedomtheevaluationisneartoconstantwhilenumericalerror halvesbetween2. and3. degree. Thepossibleexplanationofthisphenomenacan bethesimplenessofourhead modelusedforscoring. Thetongueandtheteethwas notindependently movedinthevideos,the moredegreeoffreedomwasusedonlyto approximatethe mouthcontour morepreciselywhich maywaspreciseenoughalready atlowerdegreesoffreedom.

66 6. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

101 102 103

0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 0.022

Epochs

Error

1 DoFtrain 1 DoFtest 2 DoFtrain 2 DoFtest 3 DoFtrain 3 DoFtest 4 DoFtrain 4 DoFtest 5 DoFtrain 5 DoFtest

Figure6.3: Trainingandtestingerrorofthenetworkduringtraining. Theerroristhe averagedistancebetweentheweightfromthevideodataandthecalculatedfromSpeex input.

6.1.9 Results 67

Figure6.4:Exampleswiththehungarianword”Szeptember”,it’sveryclosetoEnglish

”September”exceptthelasteisalsoopen. Eachfigureisthe mouthcontourinthe time. Theoriginaldataisfromavideoframesequence. The2and3DoFaretheresult ofdecomposition. Thelastpictureisthevoicedrivensynthesis.

68 6. VISUALSPEECHIN AUDIO TRANSMITTING TELEPRESENCE APPLICATIONS

Highertargetcomplexityinducetheneuralnetworkconvergeslower,ornotatall. Butasweincreasethedegreesoffreedom,theneuralnetwork’serrordecreasesfrom the2.degree.

Theresultsoftheopinionscoretestshowthatthebestscore/DoFrateisatthe2 DoF(Fig6.5),infactthehighestnumericalerror. Theseresultsshowthattheneural network maytraintodetails whicharenotveryimportanttothetestsubjects. As thedecompositionisbasedentirelyandonlyon mouthcontour,it maybenotthat important. Usingcorrectteethvisibilityortongue movement mayimprovetheresults, butinthistestwewereunabletotrythisbecauseofthelackofmarkersonthesefacial organs. Thisproblemisinthedecompositionphasesinceinthesynthesizedfacewe havethesefacialorgansandcontrolthemactively,butthecontrolisinaccurate.Ifthe decompositionwouldbeaffectedby moreinformation,thiscouldbecorrected. Active shapemodelingorotheradvancedtechniquesmayimprovethedecompositionmaterial. The mainconsequenceofthesubjectivetestthattwodegreesoffreedomcangive sufficientqualityforaudiovisualspeech,andtheproposed methodcangivethecontrol parametersinthisqualityfromthevoicesignal.

6.1.10 Conclusion

The mainchallengewasthestrangerepresentationofthevisualspeech. Wecansay oursystemwassuccessfullyusedthisrepresentation.

ThepresentedmethodisefficientastheCPUcostislow,thereisnonetworktraffic overhead,thefeatureextractionofthevoiceisalreadyperformedbyvoicecompression, andthespacecomplexityisscalablefortheapplication. Thefeatureisindependent fromtheotherclients,canbeturnedonwithoutexplicitsupportfromtheserveror otherclients.

Thequalityofthe mouth motionwas measuredbysubjectiveevaluation ,thepro-posedvoicedrivenfacialmotionshowssufficientqualityforon-linegames,significantly betterthantheonedimensionaljaw motion.

Letusnotethatthesystemdoesnotcontainanylanguagedependentcomponent, models. Theresultingsystem overperformsthebaseline ofthe widelyusedenergybasedinterpolationoftwovisemes.[52]

6.2. THESIS 69

Figure6.5:Subjectivescoresofthedecomposition motioncontrolandtheoutputof theneuralnetwork. Thereisasignificantimprovementbyintroducingaseconddegree offreedom. Themethod’sjudgmentfollowsthedatabase’saccordingtothecomplexity ofthegivendegreeoffreedom.

6.2.1 Novelty

FacialparametersusuallyrepresentedwithPCA.Thisnewrepresentationisawareof thedemandsofthegraphicaldesigners. Therewerenopublicationsbe foreontheus-abilityifthisrepresentationconcerningATVSdatabasebuildingorreal-timesynthesis. 6.2.2 Measurements

Subjectiveopinionscoreswereusedto measuretheresultingquality. 6.2.3 Consequences

UsingtheSpeexandthevisemecombinationrepresentationtheresultingsystemis embeddableveryeasily.

70

Summary

InthisproposedthesisIcollected mycontributionstothefieldofaud iospeechconver-siontovisualspeech,especiallydirectconversionbetweenthe modalities.