• Nem Talált Eredményt

Naturalnessofd irectconvers ion

InthischapterIdiscussthemeasurementofthenaturalnessofsyntheticvisualspeech, andcomparisonofdifferentAV mappingapproaches.

3 .1 Method

Acomparativestudyofaudio-to-visualspeechconversionisdescribedinthischapter. Thedirectfeature-basedconversionapproachiscomparedtovariousindirect ASR-basedsolutions. Thealreadydetailedbasesystemwasusedasdirectconversion. The ASRbasedsolutionsarethemostsophisticatedsystemsactuallyavailableinHungarian. The methodsaretestedinthesameenvironmentintermsofaudiopre-processing andfacial motionvisualization. Subjectiveopinionscoresshowthat withrespectto naturalness,directconversionperformswell. Conversely,withrespecttointelligibility, ASR-basedsystemsperformbetter.

ThethesisabouttheresultsofthecomparisonisimportantbecausenoAV map-pingcomparisons weredonebefore withthenoveltraningdatabaseofprofessional lip-speaker.

3.1.1 Introduction

Adifficultythatarisesincomparingthedifferentapproachesisthattheyusual lyarede-velopedandtestedindependentlybytherespectiveresearchgroups. Different metrics areused,e.g.intelligibilitytestsand/oropinionscores,anddifferentdataandv iew-ersareapplied[28].InthischapterIdescribeacomparativeevaluationofdifferent AV mappingapproacheswithinthesameworkflow,seeFigure3.1. Theperformance ofeachis measuredintermsofintelligibility, wherelip-readabilityis measured,and naturalness,whereacomparisonwithrealvisualspeechis made.

3.1.2 Audio-to-visual Conversion

Theperformanceoffivedifferentapproacheswillbeevaluated. Thesearesummarized asfollows:

21

22 3. NATURALNESS OF DIRECT CONVERSION

Figure3.1: Multipleconversion methodsweretestedinthesameenvironment

3.1.2 Audio-to-visual Conversion 23

Figure3.2:Structureofdirectconversion. - Areferencebasedonnaturalfacial motion.

- Adirectconversionsystem.

- AnASRbasedsystemthatlinearinterpolatesphonemic/visemictargets. - AninformedASR-basedapproachthathasaccesstothevocabularyofthetest

material(IASR).

- AnuninformedASR(UASR)thatdoesnothaveaccesstothetextvocabulary. Thesearedescribedin moredetailinthefollowingsections.

Directconversion

Weusedourbasesystem,withadatabaseofaprofessionallip-speaker. Thelengthof therecordedspeechwas4250frames.

24 3. NATURALNESS OF DIRECT CONVERSION

ASR-basedconversion

FortheASRbasedapproachesa WeightedFiniteStateTransducer —Hidden Markov-Model(WFST-HMM)decoderisused.Specifically,asystemknownasVOXerver[29] isused, whichcanruninoneoftwo modes:informed,thisexploitsknowledgeof thevocabularyofthetestdata,anduninformed,whichdoesnot.Incomingspeechis convertedto MFCCs,afterwhichblindchannelequalizationisusedtoreducelinear distortioninthecepstraldomain[30]. Speakerindependentcross-worddecision-tree basedtriphoneacoustic modelsareapplied, whichpreviouslyaretrainedusingthe MRBAHungarianspeechdatabase[31],whichisastandardized,phoneticallybalanced HungarianspeechdatabasedevelopenontheBudapestUniversityofTechnologyand Economics.

Theuninformed ASRsystemusesaphoneme-bigramphonotactic mode ltocon-strainthedecodingprocess. Thephoneme-bigramprobabilitieswereestimatedfrom the MRBAdatabase.IntheinformedASRsystemazerogramwordlanguage model isused withavocabularysizeof120 words. Wordpronunciations weredetermined automaticallyasdescribedin[32].

Inbothtypesofspeechrecognitionapproachesthe WFST-HMMrecognit ionnet-work wasconstructedofflineusingthe AT&TFSMtoolkit[33]. Inthecaseofthe informedsystem,phonemelabelswereprojectedtotheoutputo fthetransducerin-steadofwordlabels. Theprecisionofthesegmentationis10 ms.

Visemeinterpolation

Tocomparethedirectandindirectaudio-to-visualconversionsystems,astandard approachforgeneratingvisualparametersistofirstconvertaphonemetoitsequivalent visemeviaalookuptable,thenlinearinterpolatethevisemetargets. Thisapproachto synthesizingfacial motionisoversimplifiedbecausecoarticulationeffectsareignored, butitdoesprovideabaselineonexpectedperformance(worst-casescenario).

Modular ATVS

Toaccountforcoarticulationeffects,a moresophisticatedinterpolat ionschemeisre-quired.Inparticulartherelativedominanceofneighboringspeechsegmentsonthe articulatorsisrequired. Speechsegmentscanbeclassifiedasdominant,uncertainor mixedaccordingtothelevelofinfluenceexertedonthelocalneighborhood. Tolearn thedominancefunctionsanellipsoidisfittedtothelipsofspeakersinavideosequence articulating Hungariantriphones. Toaidthefitting,thespeakers wearadistinctly coloredlipstick. Dominancefunctionsareestimatedbythevarianceofvisualdatain agivenphoneticneighborhoodset. Thelearneddominancefunct ionsareusedtoin-terpolatebetweenthevisualtargetsderivedfromthe ASRoutput[34]. Weusethe implementationofL´aszl´oCzapandJ´anos M´aty´asherewhichproducesPoserscript. FAPsareextractedfromthisformatbythesameworkflowasfromanoriginalrecording.

3.1.3Evaluation 25

Figure3.3: ModularATVSconsistsofanASRsubsystemandatexttovisualspeech subsystem.

Rendering Module

Thevisualizationoftheoutputofthe ATVS methodsiscommontoallapproaches. Theoutputfromthe ATVS modulesarefacialanimationparameters(FAPs), which areappliedtoacommonhead modelforallapproaches. Note,althoughbetterfacial descriptorsthan MPEG-4areavailable, MPEG-4isusedherebecauseour motion capturesystemdoesnotprovide moredetailthanthis. Therenderedvideosequences arecreatedfromtheseFAPsequencesusingthe Avisynth[35]3Dfacerenderer. As themaincomponentsfortheframeworkarecommonbetweenthedifferentapproaches, anydifferencesareduetothedifferencesintheAV mapping methods. Actualframes areshownonFig3.5.

3.1.3 Evaluation

Implementationspecificnoncriticalbehavior(eg. articulationamplitude)shouldbe normalizedtoensurethatthecomparisonisbetweentheessentialqualitiesofthe methods. Todiscoverthesedifferences,apreliminarytestisdone.

Preliminarytest

Totunetheparametersofthesystems,7videos weregeneratedbyeachofthefive mapping methods,andsomesequences werere-synthesizedfromtheoriginalfacial motiondata. Allsequencesstartedandendedwithaclosed mouth,andeachcontained between2-4words. Thespeakerparticipatedinallofthetestswasnotoneofthosewho

26 3. NATURALNESS OF DIRECT CONVERSION

Table3.1: Resultsofpreliminarytestsusedtotunethesystemparameters.Shownare theaverageandstandarddeviationofscores.

Method Averagescore STD

wereinvolvedintrainingoftheaudio-to-visual-mapping. Thevideoswerepresentedin arandomizedorderto34viewerswhomwereaskedtoratethequalityofthesystems usinganopinionscore(1–5). TheresultsareshowninTable3.1.

Theresultswereunexpected,theIASR,whichusesamoresophisticatedcoarticu la-tionmodel,wasexpectedtobeoneofthebestperformingsystems. Closerinvestigation ofthelowerscoresshowedthereasonwasratherduetopooreraudiovisualsynchrony ofIASRthanforUASR.Thereasonofthisphenomenaisthedifferenceofthe mecha-nismoftheinformedandtheuninformedspeechrecognitionprocess. Duringinformed recognitionthetiminginformationisproducedasaconsequenceofthealignmentof thecorrectphonemestothesignal, whichpressesthesegmentboundariesbyusing thecertainphoneticinformation. Theuninformedrecognition may miscategoriesthe phonemebuttheacousticalchangesarethedriverofthesegmentboundaries,sothe resultingsegmentationisclosertotheacousticallyreasonablethanthephonetically drivensegmentation.

Aqualitativedifferencebetweenthedirectandindirectapproachesisthedegree of mouthopening —thedirectapproachtendedtoopenthe mouthonaverage30%

morethantheindirectapproaches. Consequently,tobringthesystemsintothesame dynamicrange,the mouthopeningforthedirect mappingwasdampedby30%. The synchronyoftheASR-basedapproacheswascheckedforsystemicerrors(constantor linearlyincreasingdelays)usingcrosscorrelationoflocallytimeshiftedwindows,but nosystematicpatternsoferrorsweredetected.

3.1.4 Results ASRsubsystem

ThequalityoftheASR-basedapproachisaffectedbytherecognizedphonemestring. Thistypicallyis100%fortheinformedsystemasthetestsetconsistsonlyofasmall numberofwords(monthsoftheyear,daysoftheweek,andnumbersunder100),whilst theuninformedsystemhasatypicalerrorrateof25.21%. Despitethisthe ATVS usingthisinputperformssurprisinglywell. Thelikelyreason mightbethepatternof confusions —oftenphonemesthatareconfusedacousticallyappearvisuallysimilaron thelips. AsecondfactorthataffectstheperformanceoftheASR-basedapproachesis precisionofthesegmentation. Generallytheuninformedsystemsare morepreciseon

3.1.4 Results 27

0 200 400 600 800 1000 1200 1400 1600

−5005000

Figure 3.4: Trajectory plot of different methodsforthe word “Hatvanh´arom”

(hOtvOnha:rom).Jawopeningandlipopeningwidthisshown. Notethatthespeaker didnotpronouncetheutteranceperfectly,andtheinformedsystemattemptstoforce a matchwiththecorrectlyrecognizedword. Thisleadstotimealignmentproblems.

theaveragethantheinformedsystems. Theprecisionofthesegmentationcanseverely impactonthesubjectiveopinionscores. Wethereforefirstattempttoquantifythese likelysourcesoferror.

Theinformedrecognitionsystemissimilarinnaturetoforcedalignmentinstandard ASRtasks.Foreachutterancetherecognizerisruninforcedalignment modeforallof thevocabularyentries. The maindifferencebetweentheinformedandtheuninformed recognitionprocessisthedifferent Markovstategraphsforrecognition. Theinformed systemisazerogramwithoutloopback,whiletheuninformedgraphisabigram model graphwheretheprobabilitiesoftheconnectionsdependonlanguagestatistics.

While matchingtheextractedfeatureswiththe Markovianstates,thedifferences arecumulatedinbothscenarios. However,theuninformedsystemallowsfordifferent phonemesoutsideofthevocabularytominimizethecumulatederror.Fortheinformed systemonlythe mostlikelysequenceisallowed, whichcandistortthesegmentation

—seeFigure3.4foranexample wherethespeaker mispronouncesthe word“Hat-vanh´arom”(hOtvOnha:rom,“63”inHungarian). The(mis)segmentationofOtvOmeans IASRAVTSsystemopensthe mouthaftertheonsetofthevowel. Humanperception issensitivetothiserrorandsothisseverelyimpactstheperceivedquality. Without forcingthevocabulary,asystemmayignoreoneoftheconsonantsbutopenthemouth atthecorrecttime.

Notethatthegeneralizationofthisphenomenaisoutofthescopeofthiswork. We havedemonstratedthatthisisaproblemwithcertainimplementationsofHMM-based ASR.Alternative, morerobustimplementations mightalleviatetheseproblems.

28 3. NATURALNESS OF DIRECT CONVERSION

Table3.2: Resultsofopinionscores,averageandstandarddeviation. Method Averagescore STD

Originalfacial motion 3.73 1.01 Directconversion 3.58 0.97

UASR 3.43 1.08

Linearinterpolation 2.73 1.12

IASR 2.67 1.29

Subjectiveopinionscores

Thetestsetupissimilartothepreliminarytestdescribedpreviouslytotunethesystem. However,58viewersareused,andonlyquantativeopinionsurveywasmadeonthescale of1(bad,veryartifical)to5(realspeech).

TheresultoftheopinionscoretestisonTable3.2. Theadvantageofd irectcon-versionagainst UASRisontheedgeofsignificance withp=0.0512as wellasthe differencebetweentheoriginalspeechandthedirectconversion withp=0.06but UASRissignificantlyworsethanoriginalspeechwithp=0.00029. Theresu ltscom-paredtothepreliminarytestalsoshowthatwithrespecttonaturalness,theexcessive articulationisnotsignificant. Theadvantageofcorrecttimingovercorrectphoneme stringisalsosignificant.

NotethatthelinearinterpolationsystemisexploitingbetterqualityASRresults, butstillperformssignificantlyworsethantheaverageofotherASRbasedapproaches. Thisshowstheimportanceofcorrectlyhandlingvisemedominanceandvisemene igh-borhoodsensitivityinASRbasedATVSsystems.

Intelligibility

Intelligibilitywasmeasuredwithatestofrecognitionofvideosequenceswithoutsound. Thisisnotthepopular Modified RhymeTest[36]butforourpurposeswithhearing impairedviewersitis morerelevant,sincethekeywordspottingisthe mostcommon lip-readingtask. The58testsubjectshadtoguesswhichwordwassaidfromagivenset of5otherwordsofthesamecategory. Thecategorieswerenumbers,namesof months andthedaysoftheweek. Allthewordsweresaidtwice. Thesetswereintervalsto eliminatethe memorytestfromthetask(forexample“2”,“3”,“4”,“5”,“6”canbe aset). Thistask modelsthesituationofhearingimpairedorverynoisyenvironment whereanATVSsystemcanbeused.Itisassumedthatthecontextisknown,sothe keywordspottingistheclosesttasktotheproblem.

Theperformanceoftheaudio-to-visualspeechconversion methodsreverseinthis taskcomparedtonaturalness. The mainresulthereisthedominanceof ASRbased approaches(Table3.3),andtheinsignificanceofthedifferencebetweeninformedand uninformed ATVSresults(p=0.43)inthistest which maydeservefurtherinvest i-gation. Notethatasthesynchronyisnotanissue withoutvoice,theIASRisthe best.

3.1.5 Conclusion 29

Table3.3: Resultsofrecognitiontests,averageandstandarddeviationofsuccessrate inpercent. Randompickwouldgive20%.

Method Precision STD

IASR 61% 20%

UASR 57% 22%

Original motion 53% 18%

Cartoon 44% 11%

Directconversion 36% 27%

Table3.4: Comparisontotheresultsof ¨OhmanandSalvi[27],aHMMandrulebased systemsintelligibilitytest.Intelligibilityofcorresponding methodsaresimilar.

Methods Prec. Prec. IASR/Ideal 61% 64%

UASR/HMM 57% 54%

Direct/ANN 36% 34%

Asacomparisonwith[27]whereintelligibilityistestedsimilarly, manuallytuned optimalrulebasedfacialparametersareclosetoourIASRsincetherewasnorecognition error,andwithoutvoicethetimealignmentqualityisnotimportant,andourTTVS isrulebased. Their HMMtestissimilartoour UASR,becausebothare without vocabulary,botharetargetingtimealignedphonemestringtobeconvertedtofacial parameters,andourASRisHMMbased. TheirANNsystemisveryclosetoourdirect conversionexceptthetrainingset,itisastandardspeechdatabaseaudio,andarule basedcalculatedtrajectoryvideodata,whileoursystemistrainedonactualrecording ofaprofessionallip-speaker. Howevertheresultsconcerningintelligibilityarecloseto eachother,seeTable3.4. Thisisavalidationoftheresults,sincethecorresponding measurementareclosetoeachother.Itisimportantthat[27]testsonlyintelligibility, andonlybetweenthree methodsofours,soour measurementisbroader.

3.1.5 Conclusion

Ipresentedacomparativestudyofaudio-to-visualspeechconversion methods. We havepresentedacomparisonofourdirectconversionsystemwithconceptuallydifferent conversionsolutions. Asubsetoftheresultscorrelatewithalreadypublishedresults, validatingtheapproachofthecomparison.

WeobservehigherimportanceofthesynchronyoverphonemeprecisioninanASR based ATVSsystem. Therearepublicationsonthehighimpactofcorrecttimingin differentaspects[34,37,38],butourresultshowexplicitlythat moreaccuratet im-ingachieves muchbettersubjectiveevaluationthan moreaccuratephonemesequence. Also,wehaveshownthatintheaspectofsubjectivenaturalnessevaluation,d irectcon-version(trainedonprofessionallip-speakerarticulation)isa methodwhichproduces thehighestopinionscoreof95.9%ofanoriginalfacial motionrecording withlower

30 3. NATURALNESS OF DIRECT CONVERSION

computationalcomplexitythanASRbasedsolutions.

Fortasks whereintelligibilityisimportant(supportforhearingimpaired,visual informationinnoisyenvironment) modular ATVSisthebestapproachamongthose presented. Our missionofaidinghearingimpairedpeoplecalluponustoconsider usingASRbasedcomponents. Fornaturalness(animation,entertainingapplications) directconversionisagoodchoice. Forbothaspects UASRgivesrelativelygoodbut notoutstandingresults.

3.1.6 Technicaldetails

Markertrackingwasdonefor MPEG-4FP8.88.48.68.18.58.38.78.25.29.29.3 9.15.12.102.1. Duringsynthesis,allFAPs(MPEG-4Facial AnimationParameter) connectedtheseFPswereusedexceptdepthinformation:

•raisercornerliplowertmidlip o

•raisebmidlip o

•raisercornerlipo Innerlipcontourisestimatedfromouter markers.

Yellowpaint wasusedto marktheFPlocationsonthefaceoftherecordedl ip-speaker. Thevideorecordingis576iPAL(576x720pixels,25frame/sec,24bit/pixel). Theaudiorecordingis mono48kHz16bitinasilentroom. Furtherconversionswere dependedontheactual method.

Markertrackingwasbasedoncolor matchingandintensitylocalizationframeto frameandthelocationwasidentifiedbytheregion.Inoverlappingregionstheclosest locationonthepreviousframewasusedtoidentifythe marker. Aframewithneutral facewasselectedtouseasthereferencetoFAPU measurement. The markeronthe noseisusedasreferencetoeliminatehead motion.

ThedirectconversionusesamodificationofDavideAnguita’s Matr ixBackpropaga-tionwhichenablesreal-timeworkalso. Theneuralnetworkused11framelongwindow ontheinputside(5framestothepastand5framestothefuture),and4principal

3.1.6 Technicaldetails 31

componentweightsofFAPontheoutput.Eachframeontheinputisrepresentedby16 band MFCCfeaturevector. Thetrainingsetofthesystemcontainsstandalonewords andphoneticallybalancedsentences.

IntheASRthespeechsignalwasconvertedtoafrequencyof16kHz. MFCC(Mel FrequencyCepstralCoefficients)-basedfeaturevectorswerecomputedwithdeltaand delta-deltacomponents(39dimensionsintotal). Therecognitionwasperformedona batchofseparatedsamples. Outputannotationsandthesampleswerejoined,andthe synchronybetweenlabelsandthesignalwaschecked manually.

Thevisemestothelinearinterpolation method wereselected manuallyforeach visemein Hungarianfromthetrainingsetofthedirectconversion. Visemesand phonemeswereassignedbyatable. Eachsegmentisalinearinterpolationfromthe actualvisemetothenextone. Linearinterpolationwascalcu latedintheFAPrepre-sentation.

TTVSisaVisualBasicimplementedsystemwithaspreadsheetofthet imedpho-neticdata. ThisspreadsheetwaschangedtotheASRoutput. Neighborhooddependent dominancepropertieswerecalculatedandvisemeratioswereextracted.L inearinterpo-lation,restrictionsconcerningbiologicalboundariesand medianfilteringwereapplied inthisorder. TheoutputisaPoserdatafilewhichisappliedtoa model. Thetexture ofthe modelis modifiedtoblackskinanddifferentlycolored MPEG-4FPlocation markers. Theanimationwasrenderedindraftmode,withthefieldofviewandreso lu-tionoftheoriginalrecording. Markertrackingwasperformedasdescribedabovewith theexceptionofthedifferentlycolored markers. FAPUvalueswere measuredinthe renderedpixelspace,andFAPvalueswerecalculatedfromFAPUandtracked marker positions.

ThiswasdoneforbothASRruns,uninformedandinformed.

Thetest materialwas manuallysegmentedto2-4wordunits. Thelengthsofthe unitswerearound3seconds. Thesegmentationboundarieswerelistedandthevideo cutwasautomaticallydonewithanAvisynthscript. Weusedan MPEG-4compatible head modelrendererpluginfor Avisynth, withthe model“Alice”of XFaceproject. Theviewpointandthefieldofviewwasadjustedtohaveonlythemouthonthescreen infrontalview.

Duringthetestthesubjectswatchedthevideosfullscreenandusedheadphones.

3 .2 Thes is

I.Ishowedthatdirect AV mapping method, whichis moreef-ficientcomputationallythan modularapproaches, overperforms the modular AV mappinginaspectofnaturalness withaspecific trainingsetofprofessionallip-speaker.[39]

3.2.1 Novelty

ThisisthefirstdirectAVmappingsystemtrainedwithdataofprofessionallip-speaker. ComparisontomodularmethodsisinterestingbecausedirectAV mappingstrainedon lowqualityarticulationcanbeeasilyoverperformedby modularsystemsinaspectof naturalnessandintelligibility.

3.2.2 Measurements

Naturalness was measuredassubjectivesimilaritytohumanarticulation. The mea-surementwasblindandrandomized,thenumberoftestsubjectswas58,andourdirect AV mappingwasnotsignificantlyworsethanoriginalvisualspeech,butthedifference betweenthe modularandtheoriginalwassignificant.

Opinionscoreaveragesanddeviationsshownnosignificantd ifferencebetweenhu-manarticulationanddirectconversion,butsignificantdifferencebetweenhumanand modular mappingbasedsystems.

The measurement wasdoneon Hungariandatabase,fluentlyreadspeech. The databasecontains mixedisolatedwordsandsentences.

3.2.3 Limitsofvalidity

Testsweredoneonnormalspeechdatabase,withfullyfocusedperceptionofthetest subjectsongoodaudioandvideoqualityvideos.

3.2.4 Consequences

Usingdirectconversionforareaswherenaturalnessis mostimportantisencouraged. Usingprofessionallip-speakertorecordaudiovisualdatabaseincreasesthequalityto becomparablewiththelevelofhumanarticulation. Otherlaboratoriestrainedtheir systems withnon-professionals,andthosesystems werenotpublicatedduetotheir poorperformance.

32

3.2.4 Consequences 33

Figure3.5: Anexampleoftheimportanceofcorrecttiming.Frameso ftheword“Ok-tober”showtimingdifferencesbetween methods. Notethatdirectconversionreceived bestscoreeventhoughitdoesnotclosethelipsonbilabialbutclosesonvelar,andit hasproblemswithliprounding.

Chapter4

Temporalasymmetry

InthischapterIdiscussthe measurementofrelevanttimewindowfordirectAV map-ping,whichisimportanttobuildaaudiotovisualspeechconversionsystemsincethe temporalwindowofinterestcanbedetermined.

4 .1 Method

Thefinetemporalstructureofrelationsofacousticandvisua lfeatureshasbeenin-vestigatedtoimproveourspeechtofacialconversionsystem. Mutualinformationof acousticandvisualfeatureshasbeencalculatedwithdifferenttimeshifts. Theresult hasshownthatthe movementoffeaturepointsonthefaceofprofessionallip-speakers canprecedeevenby100msthechangesofacousticparametersofspeechsignal. Con-sideringthistimevariationthequalityofspeechtofaceanimationconversioncanbe improvedbyusingthefuturespeechsoundtotheconversion.

4.1.1 Introduction

Otherresearchprojectsonconversionofspeechaudiosignaltofacialanimationhave concentratedondevelopmentoffeatureextractionmethods,databaseconstructionand systemtraining[40,41].Evaluationandcomparisonofdifferentsystemshavealsohad highimportanceintheliterature.InthischapterIdiscussthetemporalintegrationof acousticfeaturesoptimalforreal-timeconversiontofacialanimation. Thecriticalpart ofsuchsystemsisthebuildingofanoptimalstatistical modelforthecalculationthe videofeaturesfromtheaudiofeatures. Thereinnoknownexactrelationoftheaudio featuresetandvideofeaturesetcurrently,thisisanopenquestionyet.

Thespeechsignalconveysinformationelementsinaveryspecific way. Someof speechsoundsarerelatedrathertoasteadystateofthearticulatoryorgans,others rathertothetransition movements[42]. Ourtargetapplicationisforprovidinga communicationaidtodeafpeople.Professionallip-speakershave5-6phoneme/sspeech ratetoadaptthecommunicationtothedemandofdeafpeoplesosteadystatephases andthetransitionphasesofspeechsoundsarelongerthenineverydayspeechstyle.

35

36 4. TEMPORAL ASYMMETRY

Thesignalfeaturestocharacterizeasoundsteadystatephaseora transitionphase oreventocharacterizeaco-articulationphenomenonwhentheneighboringsoundsare highlyinterrelated,needacarefulselectionofthetemporalscopetocharacterizethe speechandvideosignal.Inour modelweselected5analysiswindowstodescribethe actualframeofspeechplustwopreviousandtwosucceededwindowstocover +/-80 msinterval.Sosuch5elementsequenceofspeechparameterscancharacterizetransient soundsandtheco-articulations.

Wehaverecognizedthatatthebeginningofwordsthelip movementsstartearlier thenthesoundproduction. Sometimes100msearlierthelipsstartto movetothe initialpositionofthesounds.Itwasthetaskofthestatistical modeltohandlethis phenomenon.

Intherefinementphaseofoursystemwehavetriedtooptimizethemodelselecting theoptimaltemporalscopeandfittingofaudioandvideofeatures. The measureof thefittinghasbasedonthe mutualinformationofaudioandvideofeatures[43].

Thebasesystemusesanadjustabletemporalwindowofaudiospeechsignal. The neuralnetworkcanbetrainedtorespondtoanarrayof MFCC windows,usingthe futureand/orpastaudiodata. Theconversioncanbeasgoodastheamountofmutual informationbetweentheaudioandvideorepresentations.

Usingthetrainedneuralnetforcalculationofcontrolparametersoffacial animation model

Theaudioprocessingunitextractstheaudio MFCCfeaturevectorsfromtheinput speechsignal. Fiveframesof MFCCvectorsareusedasinputtothetrainedneural net. TheNNprovideFacePCAweightvectors. Theseareconvertedintothecontro lpa-rametersofa MPEG-4standardfaceanimationmodel. Thetestoffittingofaudioand videofeatureswasbasedonstep-by-steptemporalshiftingoffeaturevectors.Indicator ofthematchingwas mutualinformation.Lowlevel mutualinformation meansthatwe havelowaveragechancetoestimatethefacialparametersfromtheaudiofeatureset. Thetimeshiftvaluetoproducethehighest mutualinformation meansthe maximal averagechancetocalculatetheonekindofknownfeaturesfromtheotherone.

Estimationof mutualinformationneedscomputationintensivealgorithm. The calculationisunrealisticusinglargedatabase with multidimensionalfeaturevectors. Sosingle MFCPCAandFacePCAparameterswereinterrelated. Sincethesing lepa-rametersareorthogonalbutnotindependent,theyarenotadditive. Forexamplethe FacePCA1valuesarenotindependentfromFacePCA2. Themutualinformationcurves eveninsuchcomplexcasescanindicatetheinterrelationsofparameters.

Analternative methodistocalculatecrosscorrelation. Wehavealsotestedthis method.Itneedslesscomputationalpowerbutsomeofrelationsarenotindicatedso itisalowerestimationoftheoretical maximum.

4.1.1Introduction 37

MutualinformationishighifknowingXhelpstofindoutwhatisY,anditislowif XandYareindependent.Tousethismeasurementfortemporalscopetheaudiosignal willbeshiftedintimecomparedtothevideo.Ifthetimeshiftedsignalhasstillhigh mutualinformation,it meansthatthistimevalueshouldbeinthetemporalscope.If thetimeshiftistoohigh, mutualinformationbetweenthevideoandthetimeshifted audiowillbelowduetotherelativeindependenceofdifferentphonemes.

Usingaandvasaudioandvideoframes:

∀∆t∈[−1s,1s]:MI(∆t)=n

t=1

P(at+∆t,vt)log P(at+∆t,vt)

P(at+∆t)P(vt) (4.2) where P(x,y)isestimatedbya2dimensionalhistogramconvolved with Gauss window. Gausswindowisneededtosimulatethecontinuousspaceinthehistogram incases whereonlyafewobservationsarethere. Sinceaudioandvideodataare multidimensionaland MIworkswithonedimensionaldata,allthecoefficientvectors wereprocessed,andtheresultsaresummarized. The mutualinformationvalueshave beenestimatedfrom200x200sizejointdistributionhistograms. Thehistogramshave beensmoothedby Gaussian window. The windowhas10cellradius with2.5cell deviation. The marginaldensitydistributionfunctionshavebeencalculatedfromthe sumofjointdistributionfunctions.

P(at+∆t)P(vt) (4.2) where P(x,y)isestimatedbya2dimensionalhistogramconvolved with Gauss window. Gausswindowisneededtosimulatethecontinuousspaceinthehistogram incases whereonlyafewobservationsarethere. Sinceaudioandvideodataare multidimensionaland MIworkswithonedimensionaldata,allthecoefficientvectors wereprocessed,andtheresultsaresummarized. The mutualinformationvalueshave beenestimatedfrom200x200sizejointdistributionhistograms. Thehistogramshave beensmoothedby Gaussian window. The windowhas10cellradius with2.5cell deviation. The marginaldensitydistributionfunctionshavebeencalculatedfromthe sumofjointdistributionfunctions.