InthischapterIdiscussthemeasurementofthenaturalnessofsyntheticvisualspeech, andcomparisonofdifferentAV mappingapproaches.
3 .1 Method
Acomparativestudyofaudio-to-visualspeechconversionisdescribedinthischapter. Thedirectfeature-basedconversionapproachiscomparedtovariousindirect ASR-basedsolutions. Thealreadydetailedbasesystemwasusedasdirectconversion. The ASRbasedsolutionsarethemostsophisticatedsystemsactuallyavailableinHungarian. The methodsaretestedinthesameenvironmentintermsofaudiopre-processing andfacial motionvisualization. Subjectiveopinionscoresshowthat withrespectto naturalness,directconversionperformswell. Conversely,withrespecttointelligibility, ASR-basedsystemsperformbetter.
ThethesisabouttheresultsofthecomparisonisimportantbecausenoAV map-pingcomparisons weredonebefore withthenoveltraningdatabaseofprofessional lip-speaker.
3.1.1 Introduction
Adifficultythatarisesincomparingthedifferentapproachesisthattheyusual lyarede-velopedandtestedindependentlybytherespectiveresearchgroups. Different metrics areused,e.g.intelligibilitytestsand/oropinionscores,anddifferentdataandv iew-ersareapplied[28].InthischapterIdescribeacomparativeevaluationofdifferent AV mappingapproacheswithinthesameworkflow,seeFigure3.1. Theperformance ofeachis measuredintermsofintelligibility, wherelip-readabilityis measured,and naturalness,whereacomparisonwithrealvisualspeechis made.
3.1.2 Audio-to-visual Conversion
Theperformanceoffivedifferentapproacheswillbeevaluated. Thesearesummarized asfollows:
21
22 3. NATURALNESS OF DIRECT CONVERSION
Figure3.1: Multipleconversion methodsweretestedinthesameenvironment
3.1.2 Audio-to-visual Conversion 23
Figure3.2:Structureofdirectconversion. - Areferencebasedonnaturalfacial motion.
- Adirectconversionsystem.
- AnASRbasedsystemthatlinearinterpolatesphonemic/visemictargets. - AninformedASR-basedapproachthathasaccesstothevocabularyofthetest
material(IASR).
- AnuninformedASR(UASR)thatdoesnothaveaccesstothetextvocabulary. Thesearedescribedin moredetailinthefollowingsections.
Directconversion
Weusedourbasesystem,withadatabaseofaprofessionallip-speaker. Thelengthof therecordedspeechwas4250frames.
24 3. NATURALNESS OF DIRECT CONVERSION
ASR-basedconversion
FortheASRbasedapproachesa WeightedFiniteStateTransducer —Hidden Markov-Model(WFST-HMM)decoderisused.Specifically,asystemknownasVOXerver[29] isused, whichcanruninoneoftwo modes:informed,thisexploitsknowledgeof thevocabularyofthetestdata,anduninformed,whichdoesnot.Incomingspeechis convertedto MFCCs,afterwhichblindchannelequalizationisusedtoreducelinear distortioninthecepstraldomain[30]. Speakerindependentcross-worddecision-tree basedtriphoneacoustic modelsareapplied, whichpreviouslyaretrainedusingthe MRBAHungarianspeechdatabase[31],whichisastandardized,phoneticallybalanced HungarianspeechdatabasedevelopenontheBudapestUniversityofTechnologyand Economics.
Theuninformed ASRsystemusesaphoneme-bigramphonotactic mode ltocon-strainthedecodingprocess. Thephoneme-bigramprobabilitieswereestimatedfrom the MRBAdatabase.IntheinformedASRsystemazerogramwordlanguage model isused withavocabularysizeof120 words. Wordpronunciations weredetermined automaticallyasdescribedin[32].
Inbothtypesofspeechrecognitionapproachesthe WFST-HMMrecognit ionnet-work wasconstructedofflineusingthe AT&TFSMtoolkit[33]. Inthecaseofthe informedsystem,phonemelabelswereprojectedtotheoutputo fthetransducerin-steadofwordlabels. Theprecisionofthesegmentationis10 ms.
Visemeinterpolation
Tocomparethedirectandindirectaudio-to-visualconversionsystems,astandard approachforgeneratingvisualparametersistofirstconvertaphonemetoitsequivalent visemeviaalookuptable,thenlinearinterpolatethevisemetargets. Thisapproachto synthesizingfacial motionisoversimplifiedbecausecoarticulationeffectsareignored, butitdoesprovideabaselineonexpectedperformance(worst-casescenario).
Modular ATVS
Toaccountforcoarticulationeffects,a moresophisticatedinterpolat ionschemeisre-quired.Inparticulartherelativedominanceofneighboringspeechsegmentsonthe articulatorsisrequired. Speechsegmentscanbeclassifiedasdominant,uncertainor mixedaccordingtothelevelofinfluenceexertedonthelocalneighborhood. Tolearn thedominancefunctionsanellipsoidisfittedtothelipsofspeakersinavideosequence articulating Hungariantriphones. Toaidthefitting,thespeakers wearadistinctly coloredlipstick. Dominancefunctionsareestimatedbythevarianceofvisualdatain agivenphoneticneighborhoodset. Thelearneddominancefunct ionsareusedtoin-terpolatebetweenthevisualtargetsderivedfromthe ASRoutput[34]. Weusethe implementationofL´aszl´oCzapandJ´anos M´aty´asherewhichproducesPoserscript. FAPsareextractedfromthisformatbythesameworkflowasfromanoriginalrecording.
3.1.3Evaluation 25
Figure3.3: ModularATVSconsistsofanASRsubsystemandatexttovisualspeech subsystem.
Rendering Module
Thevisualizationoftheoutputofthe ATVS methodsiscommontoallapproaches. Theoutputfromthe ATVS modulesarefacialanimationparameters(FAPs), which areappliedtoacommonhead modelforallapproaches. Note,althoughbetterfacial descriptorsthan MPEG-4areavailable, MPEG-4isusedherebecauseour motion capturesystemdoesnotprovide moredetailthanthis. Therenderedvideosequences arecreatedfromtheseFAPsequencesusingthe Avisynth[35]3Dfacerenderer. As themaincomponentsfortheframeworkarecommonbetweenthedifferentapproaches, anydifferencesareduetothedifferencesintheAV mapping methods. Actualframes areshownonFig3.5.
3.1.3 Evaluation
Implementationspecificnoncriticalbehavior(eg. articulationamplitude)shouldbe normalizedtoensurethatthecomparisonisbetweentheessentialqualitiesofthe methods. Todiscoverthesedifferences,apreliminarytestisdone.
Preliminarytest
Totunetheparametersofthesystems,7videos weregeneratedbyeachofthefive mapping methods,andsomesequences werere-synthesizedfromtheoriginalfacial motiondata. Allsequencesstartedandendedwithaclosed mouth,andeachcontained between2-4words. Thespeakerparticipatedinallofthetestswasnotoneofthosewho
26 3. NATURALNESS OF DIRECT CONVERSION
Table3.1: Resultsofpreliminarytestsusedtotunethesystemparameters.Shownare theaverageandstandarddeviationofscores.
Method Averagescore STD
wereinvolvedintrainingoftheaudio-to-visual-mapping. Thevideoswerepresentedin arandomizedorderto34viewerswhomwereaskedtoratethequalityofthesystems usinganopinionscore(1–5). TheresultsareshowninTable3.1.
Theresultswereunexpected,theIASR,whichusesamoresophisticatedcoarticu la-tionmodel,wasexpectedtobeoneofthebestperformingsystems. Closerinvestigation ofthelowerscoresshowedthereasonwasratherduetopooreraudiovisualsynchrony ofIASRthanforUASR.Thereasonofthisphenomenaisthedifferenceofthe mecha-nismoftheinformedandtheuninformedspeechrecognitionprocess. Duringinformed recognitionthetiminginformationisproducedasaconsequenceofthealignmentof thecorrectphonemestothesignal, whichpressesthesegmentboundariesbyusing thecertainphoneticinformation. Theuninformedrecognition may miscategoriesthe phonemebuttheacousticalchangesarethedriverofthesegmentboundaries,sothe resultingsegmentationisclosertotheacousticallyreasonablethanthephonetically drivensegmentation.
Aqualitativedifferencebetweenthedirectandindirectapproachesisthedegree of mouthopening —thedirectapproachtendedtoopenthe mouthonaverage30%
morethantheindirectapproaches. Consequently,tobringthesystemsintothesame dynamicrange,the mouthopeningforthedirect mappingwasdampedby30%. The synchronyoftheASR-basedapproacheswascheckedforsystemicerrors(constantor linearlyincreasingdelays)usingcrosscorrelationoflocallytimeshiftedwindows,but nosystematicpatternsoferrorsweredetected.
3.1.4 Results ASRsubsystem
ThequalityoftheASR-basedapproachisaffectedbytherecognizedphonemestring. Thistypicallyis100%fortheinformedsystemasthetestsetconsistsonlyofasmall numberofwords(monthsoftheyear,daysoftheweek,andnumbersunder100),whilst theuninformedsystemhasatypicalerrorrateof25.21%. Despitethisthe ATVS usingthisinputperformssurprisinglywell. Thelikelyreason mightbethepatternof confusions —oftenphonemesthatareconfusedacousticallyappearvisuallysimilaron thelips. AsecondfactorthataffectstheperformanceoftheASR-basedapproachesis precisionofthesegmentation. Generallytheuninformedsystemsare morepreciseon
3.1.4 Results 27
0 200 400 600 800 1000 1200 1400 1600
−5005000
Figure 3.4: Trajectory plot of different methodsforthe word “Hatvanh´arom”
(hOtvOnha:rom).Jawopeningandlipopeningwidthisshown. Notethatthespeaker didnotpronouncetheutteranceperfectly,andtheinformedsystemattemptstoforce a matchwiththecorrectlyrecognizedword. Thisleadstotimealignmentproblems.
theaveragethantheinformedsystems. Theprecisionofthesegmentationcanseverely impactonthesubjectiveopinionscores. Wethereforefirstattempttoquantifythese likelysourcesoferror.
Theinformedrecognitionsystemissimilarinnaturetoforcedalignmentinstandard ASRtasks.Foreachutterancetherecognizerisruninforcedalignment modeforallof thevocabularyentries. The maindifferencebetweentheinformedandtheuninformed recognitionprocessisthedifferent Markovstategraphsforrecognition. Theinformed systemisazerogramwithoutloopback,whiletheuninformedgraphisabigram model graphwheretheprobabilitiesoftheconnectionsdependonlanguagestatistics.
While matchingtheextractedfeatureswiththe Markovianstates,thedifferences arecumulatedinbothscenarios. However,theuninformedsystemallowsfordifferent phonemesoutsideofthevocabularytominimizethecumulatederror.Fortheinformed systemonlythe mostlikelysequenceisallowed, whichcandistortthesegmentation
—seeFigure3.4foranexample wherethespeaker mispronouncesthe word“Hat-vanh´arom”(hOtvOnha:rom,“63”inHungarian). The(mis)segmentationofOtvOmeans IASRAVTSsystemopensthe mouthaftertheonsetofthevowel. Humanperception issensitivetothiserrorandsothisseverelyimpactstheperceivedquality. Without forcingthevocabulary,asystemmayignoreoneoftheconsonantsbutopenthemouth atthecorrecttime.
Notethatthegeneralizationofthisphenomenaisoutofthescopeofthiswork. We havedemonstratedthatthisisaproblemwithcertainimplementationsofHMM-based ASR.Alternative, morerobustimplementations mightalleviatetheseproblems.
28 3. NATURALNESS OF DIRECT CONVERSION
Table3.2: Resultsofopinionscores,averageandstandarddeviation. Method Averagescore STD
Originalfacial motion 3.73 1.01 Directconversion 3.58 0.97
UASR 3.43 1.08
Linearinterpolation 2.73 1.12
IASR 2.67 1.29
Subjectiveopinionscores
Thetestsetupissimilartothepreliminarytestdescribedpreviouslytotunethesystem. However,58viewersareused,andonlyquantativeopinionsurveywasmadeonthescale of1(bad,veryartifical)to5(realspeech).
TheresultoftheopinionscoretestisonTable3.2. Theadvantageofd irectcon-versionagainst UASRisontheedgeofsignificance withp=0.0512as wellasthe differencebetweentheoriginalspeechandthedirectconversion withp=0.06but UASRissignificantlyworsethanoriginalspeechwithp=0.00029. Theresu ltscom-paredtothepreliminarytestalsoshowthatwithrespecttonaturalness,theexcessive articulationisnotsignificant. Theadvantageofcorrecttimingovercorrectphoneme stringisalsosignificant.
NotethatthelinearinterpolationsystemisexploitingbetterqualityASRresults, butstillperformssignificantlyworsethantheaverageofotherASRbasedapproaches. Thisshowstheimportanceofcorrectlyhandlingvisemedominanceandvisemene igh-borhoodsensitivityinASRbasedATVSsystems.
Intelligibility
Intelligibilitywasmeasuredwithatestofrecognitionofvideosequenceswithoutsound. Thisisnotthepopular Modified RhymeTest[36]butforourpurposeswithhearing impairedviewersitis morerelevant,sincethekeywordspottingisthe mostcommon lip-readingtask. The58testsubjectshadtoguesswhichwordwassaidfromagivenset of5otherwordsofthesamecategory. Thecategorieswerenumbers,namesof months andthedaysoftheweek. Allthewordsweresaidtwice. Thesetswereintervalsto eliminatethe memorytestfromthetask(forexample“2”,“3”,“4”,“5”,“6”canbe aset). Thistask modelsthesituationofhearingimpairedorverynoisyenvironment whereanATVSsystemcanbeused.Itisassumedthatthecontextisknown,sothe keywordspottingistheclosesttasktotheproblem.
Theperformanceoftheaudio-to-visualspeechconversion methodsreverseinthis taskcomparedtonaturalness. The mainresulthereisthedominanceof ASRbased approaches(Table3.3),andtheinsignificanceofthedifferencebetweeninformedand uninformed ATVSresults(p=0.43)inthistest which maydeservefurtherinvest i-gation. Notethatasthesynchronyisnotanissue withoutvoice,theIASRisthe best.
3.1.5 Conclusion 29
Table3.3: Resultsofrecognitiontests,averageandstandarddeviationofsuccessrate inpercent. Randompickwouldgive20%.
Method Precision STD
IASR 61% 20%
UASR 57% 22%
Original motion 53% 18%
Cartoon 44% 11%
Directconversion 36% 27%
Table3.4: Comparisontotheresultsof ¨OhmanandSalvi[27],aHMMandrulebased systemsintelligibilitytest.Intelligibilityofcorresponding methodsaresimilar.
Methods Prec. Prec. IASR/Ideal 61% 64%
UASR/HMM 57% 54%
Direct/ANN 36% 34%
Asacomparisonwith[27]whereintelligibilityistestedsimilarly, manuallytuned optimalrulebasedfacialparametersareclosetoourIASRsincetherewasnorecognition error,andwithoutvoicethetimealignmentqualityisnotimportant,andourTTVS isrulebased. Their HMMtestissimilartoour UASR,becausebothare without vocabulary,botharetargetingtimealignedphonemestringtobeconvertedtofacial parameters,andourASRisHMMbased. TheirANNsystemisveryclosetoourdirect conversionexceptthetrainingset,itisastandardspeechdatabaseaudio,andarule basedcalculatedtrajectoryvideodata,whileoursystemistrainedonactualrecording ofaprofessionallip-speaker. Howevertheresultsconcerningintelligibilityarecloseto eachother,seeTable3.4. Thisisavalidationoftheresults,sincethecorresponding measurementareclosetoeachother.Itisimportantthat[27]testsonlyintelligibility, andonlybetweenthree methodsofours,soour measurementisbroader.
3.1.5 Conclusion
Ipresentedacomparativestudyofaudio-to-visualspeechconversion methods. We havepresentedacomparisonofourdirectconversionsystemwithconceptuallydifferent conversionsolutions. Asubsetoftheresultscorrelatewithalreadypublishedresults, validatingtheapproachofthecomparison.
WeobservehigherimportanceofthesynchronyoverphonemeprecisioninanASR based ATVSsystem. Therearepublicationsonthehighimpactofcorrecttimingin differentaspects[34,37,38],butourresultshowexplicitlythat moreaccuratet im-ingachieves muchbettersubjectiveevaluationthan moreaccuratephonemesequence. Also,wehaveshownthatintheaspectofsubjectivenaturalnessevaluation,d irectcon-version(trainedonprofessionallip-speakerarticulation)isa methodwhichproduces thehighestopinionscoreof95.9%ofanoriginalfacial motionrecording withlower
30 3. NATURALNESS OF DIRECT CONVERSION
computationalcomplexitythanASRbasedsolutions.
Fortasks whereintelligibilityisimportant(supportforhearingimpaired,visual informationinnoisyenvironment) modular ATVSisthebestapproachamongthose presented. Our missionofaidinghearingimpairedpeoplecalluponustoconsider usingASRbasedcomponents. Fornaturalness(animation,entertainingapplications) directconversionisagoodchoice. Forbothaspects UASRgivesrelativelygoodbut notoutstandingresults.
3.1.6 Technicaldetails
Markertrackingwasdonefor MPEG-4FP8.88.48.68.18.58.38.78.25.29.29.3 9.15.12.102.1. Duringsynthesis,allFAPs(MPEG-4Facial AnimationParameter) connectedtheseFPswereusedexceptdepthinformation:
•raisercornerliplowertmidlip o
•raisebmidlip o
•raisercornerlipo Innerlipcontourisestimatedfromouter markers.
Yellowpaint wasusedto marktheFPlocationsonthefaceoftherecordedl ip-speaker. Thevideorecordingis576iPAL(576x720pixels,25frame/sec,24bit/pixel). Theaudiorecordingis mono48kHz16bitinasilentroom. Furtherconversionswere dependedontheactual method.
Markertrackingwasbasedoncolor matchingandintensitylocalizationframeto frameandthelocationwasidentifiedbytheregion.Inoverlappingregionstheclosest locationonthepreviousframewasusedtoidentifythe marker. Aframewithneutral facewasselectedtouseasthereferencetoFAPU measurement. The markeronthe noseisusedasreferencetoeliminatehead motion.
ThedirectconversionusesamodificationofDavideAnguita’s Matr ixBackpropaga-tionwhichenablesreal-timeworkalso. Theneuralnetworkused11framelongwindow ontheinputside(5framestothepastand5framestothefuture),and4principal
3.1.6 Technicaldetails 31
componentweightsofFAPontheoutput.Eachframeontheinputisrepresentedby16 band MFCCfeaturevector. Thetrainingsetofthesystemcontainsstandalonewords andphoneticallybalancedsentences.
IntheASRthespeechsignalwasconvertedtoafrequencyof16kHz. MFCC(Mel FrequencyCepstralCoefficients)-basedfeaturevectorswerecomputedwithdeltaand delta-deltacomponents(39dimensionsintotal). Therecognitionwasperformedona batchofseparatedsamples. Outputannotationsandthesampleswerejoined,andthe synchronybetweenlabelsandthesignalwaschecked manually.
Thevisemestothelinearinterpolation method wereselected manuallyforeach visemein Hungarianfromthetrainingsetofthedirectconversion. Visemesand phonemeswereassignedbyatable. Eachsegmentisalinearinterpolationfromthe actualvisemetothenextone. Linearinterpolationwascalcu latedintheFAPrepre-sentation.
TTVSisaVisualBasicimplementedsystemwithaspreadsheetofthet imedpho-neticdata. ThisspreadsheetwaschangedtotheASRoutput. Neighborhooddependent dominancepropertieswerecalculatedandvisemeratioswereextracted.L inearinterpo-lation,restrictionsconcerningbiologicalboundariesand medianfilteringwereapplied inthisorder. TheoutputisaPoserdatafilewhichisappliedtoa model. Thetexture ofthe modelis modifiedtoblackskinanddifferentlycolored MPEG-4FPlocation markers. Theanimationwasrenderedindraftmode,withthefieldofviewandreso lu-tionoftheoriginalrecording. Markertrackingwasperformedasdescribedabovewith theexceptionofthedifferentlycolored markers. FAPUvalueswere measuredinthe renderedpixelspace,andFAPvalueswerecalculatedfromFAPUandtracked marker positions.
ThiswasdoneforbothASRruns,uninformedandinformed.
Thetest materialwas manuallysegmentedto2-4wordunits. Thelengthsofthe unitswerearound3seconds. Thesegmentationboundarieswerelistedandthevideo cutwasautomaticallydonewithanAvisynthscript. Weusedan MPEG-4compatible head modelrendererpluginfor Avisynth, withthe model“Alice”of XFaceproject. Theviewpointandthefieldofviewwasadjustedtohaveonlythemouthonthescreen infrontalview.
Duringthetestthesubjectswatchedthevideosfullscreenandusedheadphones.
3 .2 Thes is
I.Ishowedthatdirect AV mapping method, whichis moreef-ficientcomputationallythan modularapproaches, overperforms the modular AV mappinginaspectofnaturalness withaspecific trainingsetofprofessionallip-speaker.[39]
3.2.1 Novelty
ThisisthefirstdirectAVmappingsystemtrainedwithdataofprofessionallip-speaker. ComparisontomodularmethodsisinterestingbecausedirectAV mappingstrainedon lowqualityarticulationcanbeeasilyoverperformedby modularsystemsinaspectof naturalnessandintelligibility.
3.2.2 Measurements
Naturalness was measuredassubjectivesimilaritytohumanarticulation. The mea-surementwasblindandrandomized,thenumberoftestsubjectswas58,andourdirect AV mappingwasnotsignificantlyworsethanoriginalvisualspeech,butthedifference betweenthe modularandtheoriginalwassignificant.
Opinionscoreaveragesanddeviationsshownnosignificantd ifferencebetweenhu-manarticulationanddirectconversion,butsignificantdifferencebetweenhumanand modular mappingbasedsystems.
The measurement wasdoneon Hungariandatabase,fluentlyreadspeech. The databasecontains mixedisolatedwordsandsentences.
3.2.3 Limitsofvalidity
Testsweredoneonnormalspeechdatabase,withfullyfocusedperceptionofthetest subjectsongoodaudioandvideoqualityvideos.
3.2.4 Consequences
Usingdirectconversionforareaswherenaturalnessis mostimportantisencouraged. Usingprofessionallip-speakertorecordaudiovisualdatabaseincreasesthequalityto becomparablewiththelevelofhumanarticulation. Otherlaboratoriestrainedtheir systems withnon-professionals,andthosesystems werenotpublicatedduetotheir poorperformance.
32
3.2.4 Consequences 33
Figure3.5: Anexampleoftheimportanceofcorrecttiming.Frameso ftheword“Ok-tober”showtimingdifferencesbetween methods. Notethatdirectconversionreceived bestscoreeventhoughitdoesnotclosethelipsonbilabialbutclosesonvelar,andit hasproblemswithliprounding.
Chapter4
Temporalasymmetry
InthischapterIdiscussthe measurementofrelevanttimewindowfordirectAV map-ping,whichisimportanttobuildaaudiotovisualspeechconversionsystemsincethe temporalwindowofinterestcanbedetermined.
4 .1 Method
Thefinetemporalstructureofrelationsofacousticandvisua lfeatureshasbeenin-vestigatedtoimproveourspeechtofacialconversionsystem. Mutualinformationof acousticandvisualfeatureshasbeencalculatedwithdifferenttimeshifts. Theresult hasshownthatthe movementoffeaturepointsonthefaceofprofessionallip-speakers canprecedeevenby100msthechangesofacousticparametersofspeechsignal. Con-sideringthistimevariationthequalityofspeechtofaceanimationconversioncanbe improvedbyusingthefuturespeechsoundtotheconversion.
4.1.1 Introduction
Otherresearchprojectsonconversionofspeechaudiosignaltofacialanimationhave concentratedondevelopmentoffeatureextractionmethods,databaseconstructionand systemtraining[40,41].Evaluationandcomparisonofdifferentsystemshavealsohad highimportanceintheliterature.InthischapterIdiscussthetemporalintegrationof acousticfeaturesoptimalforreal-timeconversiontofacialanimation. Thecriticalpart ofsuchsystemsisthebuildingofanoptimalstatistical modelforthecalculationthe videofeaturesfromtheaudiofeatures. Thereinnoknownexactrelationoftheaudio featuresetandvideofeaturesetcurrently,thisisanopenquestionyet.
Thespeechsignalconveysinformationelementsinaveryspecific way. Someof speechsoundsarerelatedrathertoasteadystateofthearticulatoryorgans,others rathertothetransition movements[42]. Ourtargetapplicationisforprovidinga communicationaidtodeafpeople.Professionallip-speakershave5-6phoneme/sspeech ratetoadaptthecommunicationtothedemandofdeafpeoplesosteadystatephases andthetransitionphasesofspeechsoundsarelongerthenineverydayspeechstyle.
35
36 4. TEMPORAL ASYMMETRY
Thesignalfeaturestocharacterizeasoundsteadystatephaseora transitionphase oreventocharacterizeaco-articulationphenomenonwhentheneighboringsoundsare highlyinterrelated,needacarefulselectionofthetemporalscopetocharacterizethe speechandvideosignal.Inour modelweselected5analysiswindowstodescribethe actualframeofspeechplustwopreviousandtwosucceededwindowstocover +/-80 msinterval.Sosuch5elementsequenceofspeechparameterscancharacterizetransient soundsandtheco-articulations.
Wehaverecognizedthatatthebeginningofwordsthelip movementsstartearlier thenthesoundproduction. Sometimes100msearlierthelipsstartto movetothe initialpositionofthesounds.Itwasthetaskofthestatistical modeltohandlethis phenomenon.
Intherefinementphaseofoursystemwehavetriedtooptimizethemodelselecting theoptimaltemporalscopeandfittingofaudioandvideofeatures. The measureof thefittinghasbasedonthe mutualinformationofaudioandvideofeatures[43].
Thebasesystemusesanadjustabletemporalwindowofaudiospeechsignal. The neuralnetworkcanbetrainedtorespondtoanarrayof MFCC windows,usingthe futureand/orpastaudiodata. Theconversioncanbeasgoodastheamountofmutual informationbetweentheaudioandvideorepresentations.
Usingthetrainedneuralnetforcalculationofcontrolparametersoffacial animation model
Theaudioprocessingunitextractstheaudio MFCCfeaturevectorsfromtheinput speechsignal. Fiveframesof MFCCvectorsareusedasinputtothetrainedneural net. TheNNprovideFacePCAweightvectors. Theseareconvertedintothecontro lpa-rametersofa MPEG-4standardfaceanimationmodel. Thetestoffittingofaudioand videofeatureswasbasedonstep-by-steptemporalshiftingoffeaturevectors.Indicator ofthematchingwas mutualinformation.Lowlevel mutualinformation meansthatwe havelowaveragechancetoestimatethefacialparametersfromtheaudiofeatureset. Thetimeshiftvaluetoproducethehighest mutualinformation meansthe maximal averagechancetocalculatetheonekindofknownfeaturesfromtheotherone.
Estimationof mutualinformationneedscomputationintensivealgorithm. The calculationisunrealisticusinglargedatabase with multidimensionalfeaturevectors. Sosingle MFCPCAandFacePCAparameterswereinterrelated. Sincethesing lepa-rametersareorthogonalbutnotindependent,theyarenotadditive. Forexamplethe FacePCA1valuesarenotindependentfromFacePCA2. Themutualinformationcurves eveninsuchcomplexcasescanindicatetheinterrelationsofparameters.
Analternative methodistocalculatecrosscorrelation. Wehavealsotestedthis method.Itneedslesscomputationalpowerbutsomeofrelationsarenotindicatedso itisalowerestimationoftheoretical maximum.
4.1.1Introduction 37
MutualinformationishighifknowingXhelpstofindoutwhatisY,anditislowif XandYareindependent.Tousethismeasurementfortemporalscopetheaudiosignal willbeshiftedintimecomparedtothevideo.Ifthetimeshiftedsignalhasstillhigh mutualinformation,it meansthatthistimevalueshouldbeinthetemporalscope.If thetimeshiftistoohigh, mutualinformationbetweenthevideoandthetimeshifted audiowillbelowduetotherelativeindependenceofdifferentphonemes.
Usingaandvasaudioandvideoframes:
∀∆t∈[−1s,1s]:MI(∆t)=n
t=1
P(at+∆t,vt)log P(at+∆t,vt)
P(at+∆t)P(vt) (4.2) where P(x,y)isestimatedbya2dimensionalhistogramconvolved with Gauss window. Gausswindowisneededtosimulatethecontinuousspaceinthehistogram incases whereonlyafewobservationsarethere. Sinceaudioandvideodataare multidimensionaland MIworkswithonedimensionaldata,allthecoefficientvectors wereprocessed,andtheresultsaresummarized. The mutualinformationvalueshave beenestimatedfrom200x200sizejointdistributionhistograms. Thehistogramshave beensmoothedby Gaussian window. The windowhas10cellradius with2.5cell deviation. The marginaldensitydistributionfunctionshavebeencalculatedfromthe sumofjointdistributionfunctions.
P(at+∆t)P(vt) (4.2) where P(x,y)isestimatedbya2dimensionalhistogramconvolved with Gauss window. Gausswindowisneededtosimulatethecontinuousspaceinthehistogram incases whereonlyafewobservationsarethere. Sinceaudioandvideodataare multidimensionaland MIworkswithonedimensionaldata,allthecoefficientvectors wereprocessed,andtheresultsaresummarized. The mutualinformationvalueshave beenestimatedfrom200x200sizejointdistributionhistograms. Thehistogramshave beensmoothedby Gaussian window. The windowhas10cellradius with2.5cell deviation. The marginaldensitydistributionfunctionshavebeencalculatedfromthe sumofjointdistributionfunctions.