Naturalnessofd irectconvers ion - Audiotovisualspeechconversion GergelyFeldhoﬀer

InthischapterIdiscussthemeasurementofthenaturalnessofsyntheticvisualspeech, andcomparisonofdiﬀerentAV mappingapproaches.

3 .1 Method

Acomparativestudyofaudio-to-visualspeechconversionisdescribedinthischapter. Thedirectfeature-basedconversionapproachiscomparedtovariousindirect ASR-basedsolutions. Thealreadydetailedbasesystemwasusedasdirectconversion. The ASRbasedsolutionsarethemostsophisticatedsystemsactuallyavailableinHungarian. The methodsaretestedinthesameenvironmentintermsofaudiopre-processing andfacial motionvisualization. Subjectiveopinionscoresshowthat withrespectto naturalness,directconversionperformswell. Conversely,withrespecttointelligibility, ASR-basedsystemsperformbetter.

ThethesisabouttheresultsofthecomparisonisimportantbecausenoAV map-pingcomparisons weredonebefore withthenoveltraningdatabaseofprofessional lip-speaker.

3.1.1 Introduction

Adiﬃcultythatarisesincomparingthediﬀerentapproachesisthattheyusual lyarede-velopedandtestedindependentlybytherespectiveresearchgroups. Diﬀerent metrics areused,e.g.intelligibilitytestsand/oropinionscores,anddiﬀerentdataandv iew-ersareapplied[28].InthischapterIdescribeacomparativeevaluationofdiﬀerent AV mappingapproacheswithinthesameworkﬂow,seeFigure3.1. Theperformance ofeachis measuredintermsofintelligibility, wherelip-readabilityis measured,and naturalness,whereacomparisonwithrealvisualspeechis made.

3.1.2 Audio-to-visual Conversion

Theperformanceofﬁvediﬀerentapproacheswillbeevaluated. Thesearesummarized asfollows:

22 3. NATURALNESS OF DIRECT CONVERSION

Figure3.1: Multipleconversion methodsweretestedinthesameenvironment

3.1.2 Audio-to-visual Conversion 23

Figure3.2:Structureofdirectconversion. - Areferencebasedonnaturalfacial motion.

- Adirectconversionsystem.

- AnASRbasedsystemthatlinearinterpolatesphonemic/visemictargets. - AninformedASR-basedapproachthathasaccesstothevocabularyofthetest

material(IASR).

- AnuninformedASR(UASR)thatdoesnothaveaccesstothetextvocabulary. Thesearedescribedin moredetailinthefollowingsections.

Directconversion

Weusedourbasesystem,withadatabaseofaprofessionallip-speaker. Thelengthof therecordedspeechwas4250frames.

24 3. NATURALNESS OF DIRECT CONVERSION

ASR-basedconversion

FortheASRbasedapproachesa WeightedFiniteStateTransducer —Hidden Markov-Model(WFST-HMM)decoderisused.Speciﬁcally,asystemknownasVOXerver[29] isused, whichcanruninoneoftwo modes:informed,thisexploitsknowledgeof thevocabularyofthetestdata,anduninformed,whichdoesnot.Incomingspeechis convertedto MFCCs,afterwhichblindchannelequalizationisusedtoreducelinear distortioninthecepstraldomain[30]. Speakerindependentcross-worddecision-tree basedtriphoneacoustic modelsareapplied, whichpreviouslyaretrainedusingthe MRBAHungarianspeechdatabase[31],whichisastandardized,phoneticallybalanced HungarianspeechdatabasedevelopenontheBudapestUniversityofTechnologyand Economics.

Theuninformed ASRsystemusesaphoneme-bigramphonotactic mode ltocon-strainthedecodingprocess. Thephoneme-bigramprobabilitieswereestimatedfrom the MRBAdatabase.IntheinformedASRsystemazerogramwordlanguage model isused withavocabularysizeof120 words. Wordpronunciations weredetermined automaticallyasdescribedin[32].

Inbothtypesofspeechrecognitionapproachesthe WFST-HMMrecognit ionnet-work wasconstructedoﬄineusingthe AT&TFSMtoolkit[33]. Inthecaseofthe informedsystem,phonemelabelswereprojectedtotheoutputo fthetransducerin-steadofwordlabels. Theprecisionofthesegmentationis10 ms.

Visemeinterpolation

Tocomparethedirectandindirectaudio-to-visualconversionsystems,astandard approachforgeneratingvisualparametersistoﬁrstconvertaphonemetoitsequivalent visemeviaalookuptable,thenlinearinterpolatethevisemetargets. Thisapproachto synthesizingfacial motionisoversimpliﬁedbecausecoarticulationeﬀectsareignored, butitdoesprovideabaselineonexpectedperformance(worst-casescenario).

Modular ATVS

Toaccountforcoarticulationeﬀects,a moresophisticatedinterpolat ionschemeisre-quired.Inparticulartherelativedominanceofneighboringspeechsegmentsonthe articulatorsisrequired. Speechsegmentscanbeclassiﬁedasdominant,uncertainor mixedaccordingtothelevelofinﬂuenceexertedonthelocalneighborhood. Tolearn thedominancefunctionsanellipsoidisﬁttedtothelipsofspeakersinavideosequence articulating Hungariantriphones. Toaidtheﬁtting,thespeakers wearadistinctly coloredlipstick. Dominancefunctionsareestimatedbythevarianceofvisualdatain agivenphoneticneighborhoodset. Thelearneddominancefunct ionsareusedtoin-terpolatebetweenthevisualtargetsderivedfromthe ASRoutput[34]. Weusethe implementationofL´aszl´oCzapandJ´anos M´aty´asherewhichproducesPoserscript. FAPsareextractedfromthisformatbythesameworkﬂowasfromanoriginalrecording.

3.1.3Evaluation 25

Figure3.3: ModularATVSconsistsofanASRsubsystemandatexttovisualspeech subsystem.

Rendering Module

Thevisualizationoftheoutputofthe ATVS methodsiscommontoallapproaches. Theoutputfromthe ATVS modulesarefacialanimationparameters(FAPs), which areappliedtoacommonhead modelforallapproaches. Note,althoughbetterfacial descriptorsthan MPEG-4areavailable, MPEG-4isusedherebecauseour motion capturesystemdoesnotprovide moredetailthanthis. Therenderedvideosequences arecreatedfromtheseFAPsequencesusingthe Avisynth[35]3Dfacerenderer. As themaincomponentsfortheframeworkarecommonbetweenthediﬀerentapproaches, anydiﬀerencesareduetothediﬀerencesintheAV mapping methods. Actualframes areshownonFig3.5.

3.1.3 Evaluation

Implementationspeciﬁcnoncriticalbehavior(eg. articulationamplitude)shouldbe normalizedtoensurethatthecomparisonisbetweentheessentialqualitiesofthe methods. Todiscoverthesediﬀerences,apreliminarytestisdone.

Preliminarytest

Totunetheparametersofthesystems,7videos weregeneratedbyeachoftheﬁve mapping methods,andsomesequences werere-synthesizedfromtheoriginalfacial motiondata. Allsequencesstartedandendedwithaclosed mouth,andeachcontained between2-4words. Thespeakerparticipatedinallofthetestswasnotoneofthosewho

26 3. NATURALNESS OF DIRECT CONVERSION

Table3.1: Resultsofpreliminarytestsusedtotunethesystemparameters.Shownare theaverageandstandarddeviationofscores.

Method Averagescore STD

wereinvolvedintrainingoftheaudio-to-visual-mapping. Thevideoswerepresentedin arandomizedorderto34viewerswhomwereaskedtoratethequalityofthesystems usinganopinionscore(1–5). TheresultsareshowninTable3.1.

Theresultswereunexpected,theIASR,whichusesamoresophisticatedcoarticu la-tionmodel,wasexpectedtobeoneofthebestperformingsystems. Closerinvestigation ofthelowerscoresshowedthereasonwasratherduetopooreraudiovisualsynchrony ofIASRthanforUASR.Thereasonofthisphenomenaisthediﬀerenceofthe mecha-nismoftheinformedandtheuninformedspeechrecognitionprocess. Duringinformed recognitionthetiminginformationisproducedasaconsequenceofthealignmentof thecorrectphonemestothesignal, whichpressesthesegmentboundariesbyusing thecertainphoneticinformation. Theuninformedrecognition may miscategoriesthe phonemebuttheacousticalchangesarethedriverofthesegmentboundaries,sothe resultingsegmentationisclosertotheacousticallyreasonablethanthephonetically drivensegmentation.

Aqualitativediﬀerencebetweenthedirectandindirectapproachesisthedegree of mouthopening —thedirectapproachtendedtoopenthe mouthonaverage30%

morethantheindirectapproaches. Consequently,tobringthesystemsintothesame dynamicrange,the mouthopeningforthedirect mappingwasdampedby30%. The synchronyoftheASR-basedapproacheswascheckedforsystemicerrors(constantor linearlyincreasingdelays)usingcrosscorrelationoflocallytimeshiftedwindows,but nosystematicpatternsoferrorsweredetected.

3.1.4 Results ASRsubsystem

ThequalityoftheASR-basedapproachisaﬀectedbytherecognizedphonemestring. Thistypicallyis100%fortheinformedsystemasthetestsetconsistsonlyofasmall numberofwords(monthsoftheyear,daysoftheweek,andnumbersunder100),whilst theuninformedsystemhasatypicalerrorrateof25.21%. Despitethisthe ATVS usingthisinputperformssurprisinglywell. Thelikelyreason mightbethepatternof confusions —oftenphonemesthatareconfusedacousticallyappearvisuallysimilaron thelips. AsecondfactorthataﬀectstheperformanceoftheASR-basedapproachesis precisionofthesegmentation. Generallytheuninformedsystemsare morepreciseon

3.1.4 Results 27

0 200 400 600 800 1000 1200 1400 1600

−5005000

Figure 3.4: Trajectory plot of diﬀerent methodsforthe word “Hatvanh´arom”

(hOtvOnha:rom).Jawopeningandlipopeningwidthisshown. Notethatthespeaker didnotpronouncetheutteranceperfectly,andtheinformedsystemattemptstoforce a matchwiththecorrectlyrecognizedword. Thisleadstotimealignmentproblems.

theaveragethantheinformedsystems. Theprecisionofthesegmentationcanseverely impactonthesubjectiveopinionscores. Wethereforeﬁrstattempttoquantifythese likelysourcesoferror.

Theinformedrecognitionsystemissimilarinnaturetoforcedalignmentinstandard ASRtasks.Foreachutterancetherecognizerisruninforcedalignment modeforallof thevocabularyentries. The maindiﬀerencebetweentheinformedandtheuninformed recognitionprocessisthediﬀerent Markovstategraphsforrecognition. Theinformed systemisazerogramwithoutloopback,whiletheuninformedgraphisabigram model graphwheretheprobabilitiesoftheconnectionsdependonlanguagestatistics.

While matchingtheextractedfeatureswiththe Markovianstates,thediﬀerences arecumulatedinbothscenarios. However,theuninformedsystemallowsfordiﬀerent phonemesoutsideofthevocabularytominimizethecumulatederror.Fortheinformed systemonlythe mostlikelysequenceisallowed, whichcandistortthesegmentation

—seeFigure3.4foranexample wherethespeaker mispronouncesthe word“Hat-vanh´arom”(hOtvOnha:rom,“63”inHungarian). The(mis)segmentationofOtvOmeans IASRAVTSsystemopensthe mouthaftertheonsetofthevowel. Humanperception issensitivetothiserrorandsothisseverelyimpactstheperceivedquality. Without forcingthevocabulary,asystemmayignoreoneoftheconsonantsbutopenthemouth atthecorrecttime.

Notethatthegeneralizationofthisphenomenaisoutofthescopeofthiswork. We havedemonstratedthatthisisaproblemwithcertainimplementationsofHMM-based ASR.Alternative, morerobustimplementations mightalleviatetheseproblems.

28 3. NATURALNESS OF DIRECT CONVERSION

Table3.2: Resultsofopinionscores,averageandstandarddeviation. Method Averagescore STD

Originalfacial motion 3.73 1.01 Directconversion 3.58 0.97

UASR 3.43 1.08

Linearinterpolation 2.73 1.12

IASR 2.67 1.29

Subjectiveopinionscores

Thetestsetupissimilartothepreliminarytestdescribedpreviouslytotunethesystem. However,58viewersareused,andonlyquantativeopinionsurveywasmadeonthescale of1(bad,veryartiﬁcal)to5(realspeech).

TheresultoftheopinionscoretestisonTable3.2. Theadvantageofd irectcon-versionagainst UASRisontheedgeofsigniﬁcance withp=0.0512as wellasthe diﬀerencebetweentheoriginalspeechandthedirectconversion withp=0.06but UASRissigniﬁcantlyworsethanoriginalspeechwithp=0.00029. Theresu ltscom-paredtothepreliminarytestalsoshowthatwithrespecttonaturalness,theexcessive articulationisnotsigniﬁcant. Theadvantageofcorrecttimingovercorrectphoneme stringisalsosigniﬁcant.

NotethatthelinearinterpolationsystemisexploitingbetterqualityASRresults, butstillperformssigniﬁcantlyworsethantheaverageofotherASRbasedapproaches. Thisshowstheimportanceofcorrectlyhandlingvisemedominanceandvisemene igh-borhoodsensitivityinASRbasedATVSsystems.

Intelligibility

Intelligibilitywasmeasuredwithatestofrecognitionofvideosequenceswithoutsound. Thisisnotthepopular Modiﬁed RhymeTest[36]butforourpurposeswithhearing impairedviewersitis morerelevant,sincethekeywordspottingisthe mostcommon lip-readingtask. The58testsubjectshadtoguesswhichwordwassaidfromagivenset of5otherwordsofthesamecategory. Thecategorieswerenumbers,namesof months andthedaysoftheweek. Allthewordsweresaidtwice. Thesetswereintervalsto eliminatethe memorytestfromthetask(forexample“2”,“3”,“4”,“5”,“6”canbe aset). Thistask modelsthesituationofhearingimpairedorverynoisyenvironment whereanATVSsystemcanbeused.Itisassumedthatthecontextisknown,sothe keywordspottingistheclosesttasktotheproblem.

Theperformanceoftheaudio-to-visualspeechconversion methodsreverseinthis taskcomparedtonaturalness. The mainresulthereisthedominanceof ASRbased approaches(Table3.3),andtheinsigniﬁcanceofthediﬀerencebetweeninformedand uninformed ATVSresults(p=0.43)inthistest which maydeservefurtherinvest i-gation. Notethatasthesynchronyisnotanissue withoutvoice,theIASRisthe best.

3.1.5 Conclusion 29

Table3.3: Resultsofrecognitiontests,averageandstandarddeviationofsuccessrate inpercent. Randompickwouldgive20%.

Method Precision STD

IASR 61% 20%

UASR 57% 22%

Original motion 53% 18%

Cartoon 44% 11%

Directconversion 36% 27%

Table3.4: Comparisontotheresultsof ¨OhmanandSalvi[27],aHMMandrulebased systemsintelligibilitytest.Intelligibilityofcorresponding methodsaresimilar.

Methods Prec. Prec. IASR/Ideal 61% 64%

UASR/HMM 57% 54%

Direct/ANN 36% 34%

Asacomparisonwith[27]whereintelligibilityistestedsimilarly, manuallytuned optimalrulebasedfacialparametersareclosetoourIASRsincetherewasnorecognition error,andwithoutvoicethetimealignmentqualityisnotimportant,andourTTVS isrulebased. Their HMMtestissimilartoour UASR,becausebothare without vocabulary,botharetargetingtimealignedphonemestringtobeconvertedtofacial parameters,andourASRisHMMbased. TheirANNsystemisveryclosetoourdirect conversionexceptthetrainingset,itisastandardspeechdatabaseaudio,andarule basedcalculatedtrajectoryvideodata,whileoursystemistrainedonactualrecording ofaprofessionallip-speaker. Howevertheresultsconcerningintelligibilityarecloseto eachother,seeTable3.4. Thisisavalidationoftheresults,sincethecorresponding measurementareclosetoeachother.Itisimportantthat[27]testsonlyintelligibility, andonlybetweenthree methodsofours,soour measurementisbroader.

3.1.5 Conclusion

Ipresentedacomparativestudyofaudio-to-visualspeechconversion methods. We havepresentedacomparisonofourdirectconversionsystemwithconceptuallydiﬀerent conversionsolutions. Asubsetoftheresultscorrelatewithalreadypublishedresults, validatingtheapproachofthecomparison.

WeobservehigherimportanceofthesynchronyoverphonemeprecisioninanASR based ATVSsystem. Therearepublicationsonthehighimpactofcorrecttimingin diﬀerentaspects[34,37,38],butourresultshowexplicitlythat moreaccuratet im-ingachieves muchbettersubjectiveevaluationthan moreaccuratephonemesequence. Also,wehaveshownthatintheaspectofsubjectivenaturalnessevaluation,d irectcon-version(trainedonprofessionallip-speakerarticulation)isa methodwhichproduces thehighestopinionscoreof95.9%ofanoriginalfacial motionrecording withlower

30 3. NATURALNESS OF DIRECT CONVERSION

computationalcomplexitythanASRbasedsolutions.

Fortasks whereintelligibilityisimportant(supportforhearingimpaired,visual informationinnoisyenvironment) modular ATVSisthebestapproachamongthose presented. Our missionofaidinghearingimpairedpeoplecalluponustoconsider usingASRbasedcomponents. Fornaturalness(animation,entertainingapplications) directconversionisagoodchoice. Forbothaspects UASRgivesrelativelygoodbut notoutstandingresults.

3.1.6 Technicaldetails

Markertrackingwasdonefor MPEG-4FP8.88.48.68.18.58.38.78.25.29.29.3 9.15.12.102.1. Duringsynthesis,allFAPs(MPEG-4Facial AnimationParameter) connectedtheseFPswereusedexceptdepthinformation:

•raisercornerliplowertmidlip o

•raisebmidlip o

•raisercornerlipo Innerlipcontourisestimatedfromouter markers.

Yellowpaint wasusedto marktheFPlocationsonthefaceoftherecordedl ip-speaker. Thevideorecordingis576iPAL(576x720pixels,25frame/sec,24bit/pixel). Theaudiorecordingis mono48kHz16bitinasilentroom. Furtherconversionswere dependedontheactual method.

Markertrackingwasbasedoncolor matchingandintensitylocalizationframeto frameandthelocationwasidentiﬁedbytheregion.Inoverlappingregionstheclosest locationonthepreviousframewasusedtoidentifythe marker. Aframewithneutral facewasselectedtouseasthereferencetoFAPU measurement. The markeronthe noseisusedasreferencetoeliminatehead motion.

ThedirectconversionusesamodiﬁcationofDavideAnguita’s Matr ixBackpropaga-tionwhichenablesreal-timeworkalso. Theneuralnetworkused11framelongwindow ontheinputside(5framestothepastand5framestothefuture),and4principal

3.1.6 Technicaldetails 31

componentweightsofFAPontheoutput.Eachframeontheinputisrepresentedby16 band MFCCfeaturevector. Thetrainingsetofthesystemcontainsstandalonewords andphoneticallybalancedsentences.

IntheASRthespeechsignalwasconvertedtoafrequencyof16kHz. MFCC(Mel FrequencyCepstralCoeﬃcients)-basedfeaturevectorswerecomputedwithdeltaand delta-deltacomponents(39dimensionsintotal). Therecognitionwasperformedona batchofseparatedsamples. Outputannotationsandthesampleswerejoined,andthe synchronybetweenlabelsandthesignalwaschecked manually.

Thevisemestothelinearinterpolation method wereselected manuallyforeach visemein Hungarianfromthetrainingsetofthedirectconversion. Visemesand phonemeswereassignedbyatable. Eachsegmentisalinearinterpolationfromthe actualvisemetothenextone. Linearinterpolationwascalcu latedintheFAPrepre-sentation.

TTVSisaVisualBasicimplementedsystemwithaspreadsheetofthet imedpho-neticdata. ThisspreadsheetwaschangedtotheASRoutput. Neighborhooddependent dominancepropertieswerecalculatedandvisemeratioswereextracted.L inearinterpo-lation,restrictionsconcerningbiologicalboundariesand medianﬁlteringwereapplied inthisorder. TheoutputisaPoserdataﬁlewhichisappliedtoa model. Thetexture ofthe modelis modiﬁedtoblackskinanddiﬀerentlycolored MPEG-4FPlocation markers. Theanimationwasrenderedindraftmode,withtheﬁeldofviewandreso lu-tionoftheoriginalrecording. Markertrackingwasperformedasdescribedabovewith theexceptionofthediﬀerentlycolored markers. FAPUvalueswere measuredinthe renderedpixelspace,andFAPvalueswerecalculatedfromFAPUandtracked marker positions.

ThiswasdoneforbothASRruns,uninformedandinformed.

Thetest materialwas manuallysegmentedto2-4wordunits. Thelengthsofthe unitswerearound3seconds. Thesegmentationboundarieswerelistedandthevideo cutwasautomaticallydonewithanAvisynthscript. Weusedan MPEG-4compatible head modelrendererpluginfor Avisynth, withthe model“Alice”of XFaceproject. Theviewpointandtheﬁeldofviewwasadjustedtohaveonlythemouthonthescreen infrontalview.

Duringthetestthesubjectswatchedthevideosfullscreenandusedheadphones.

3 .2 Thes is

I.Ishowedthatdirect AV mapping method, whichis moreef-ﬁcientcomputationallythan modularapproaches, overperforms the modular AV mappinginaspectofnaturalness withaspeciﬁc trainingsetofprofessionallip-speaker.[39]

3.2.1 Novelty

ThisistheﬁrstdirectAVmappingsystemtrainedwithdataofprofessionallip-speaker. ComparisontomodularmethodsisinterestingbecausedirectAV mappingstrainedon lowqualityarticulationcanbeeasilyoverperformedby modularsystemsinaspectof naturalnessandintelligibility.

3.2.2 Measurements

Naturalness was measuredassubjectivesimilaritytohumanarticulation. The mea-surementwasblindandrandomized,thenumberoftestsubjectswas58,andourdirect AV mappingwasnotsigniﬁcantlyworsethanoriginalvisualspeech,butthediﬀerence betweenthe modularandtheoriginalwassigniﬁcant.

Opinionscoreaveragesanddeviationsshownnosigniﬁcantd iﬀerencebetweenhu-manarticulationanddirectconversion,butsigniﬁcantdiﬀerencebetweenhumanand modular mappingbasedsystems.

The measurement wasdoneon Hungariandatabase,ﬂuentlyreadspeech. The databasecontains mixedisolatedwordsandsentences.

3.2.3 Limitsofvalidity

Testsweredoneonnormalspeechdatabase,withfullyfocusedperceptionofthetest subjectsongoodaudioandvideoqualityvideos.

3.2.4 Consequences

Usingdirectconversionforareaswherenaturalnessis mostimportantisencouraged. Usingprofessionallip-speakertorecordaudiovisualdatabaseincreasesthequalityto becomparablewiththelevelofhumanarticulation. Otherlaboratoriestrainedtheir systems withnon-professionals,andthosesystems werenotpublicatedduetotheir poorperformance.

3.2.4 Consequences 33

Figure3.5: Anexampleoftheimportanceofcorrecttiming.Frameso ftheword“Ok-tober”showtimingdiﬀerencesbetween methods. Notethatdirectconversionreceived bestscoreeventhoughitdoesnotclosethelipsonbilabialbutclosesonvelar,andit hasproblemswithliprounding.

Chapter4

Temporalasymmetry

InthischapterIdiscussthe measurementofrelevanttimewindowfordirectAV map-ping,whichisimportanttobuildaaudiotovisualspeechconversionsystemsincethe temporalwindowofinterestcanbedetermined.

4 .1 Method

Theﬁnetemporalstructureofrelationsofacousticandvisua lfeatureshasbeenin-vestigatedtoimproveourspeechtofacialconversionsystem. Mutualinformationof acousticandvisualfeatureshasbeencalculatedwithdiﬀerenttimeshifts. Theresult hasshownthatthe movementoffeaturepointsonthefaceofprofessionallip-speakers canprecedeevenby100msthechangesofacousticparametersofspeechsignal. Con-sideringthistimevariationthequalityofspeechtofaceanimationconversioncanbe improvedbyusingthefuturespeechsoundtotheconversion.

4.1.1 Introduction

Otherresearchprojectsonconversionofspeechaudiosignaltofacialanimationhave concentratedondevelopmentoffeatureextractionmethods,databaseconstructionand systemtraining[40,41].Evaluationandcomparisonofdiﬀerentsystemshavealsohad highimportanceintheliterature.InthischapterIdiscussthetemporalintegrationof acousticfeaturesoptimalforreal-timeconversiontofacialanimation. Thecriticalpart ofsuchsystemsisthebuildingofanoptimalstatistical modelforthecalculationthe videofeaturesfromtheaudiofeatures. Thereinnoknownexactrelationoftheaudio featuresetandvideofeaturesetcurrently,thisisanopenquestionyet.

Thespeechsignalconveysinformationelementsinaveryspeciﬁc way. Someof speechsoundsarerelatedrathertoasteadystateofthearticulatoryorgans,others rathertothetransition movements[42]. Ourtargetapplicationisforprovidinga communicationaidtodeafpeople.Professionallip-speakershave5-6phoneme/sspeech ratetoadaptthecommunicationtothedemandofdeafpeoplesosteadystatephases andthetransitionphasesofspeechsoundsarelongerthenineverydayspeechstyle.

36 4. TEMPORAL ASYMMETRY

Thesignalfeaturestocharacterizeasoundsteadystatephaseora transitionphase oreventocharacterizeaco-articulationphenomenonwhentheneighboringsoundsare highlyinterrelated,needacarefulselectionofthetemporalscopetocharacterizethe speechandvideosignal.Inour modelweselected5analysiswindowstodescribethe actualframeofspeechplustwopreviousandtwosucceededwindowstocover +/-80 msinterval.Sosuch5elementsequenceofspeechparameterscancharacterizetransient soundsandtheco-articulations.

Wehaverecognizedthatatthebeginningofwordsthelip movementsstartearlier thenthesoundproduction. Sometimes100msearlierthelipsstartto movetothe initialpositionofthesounds.Itwasthetaskofthestatistical modeltohandlethis phenomenon.

Inthereﬁnementphaseofoursystemwehavetriedtooptimizethemodelselecting theoptimaltemporalscopeandﬁttingofaudioandvideofeatures. The measureof theﬁttinghasbasedonthe mutualinformationofaudioandvideofeatures[43].

Thebasesystemusesanadjustabletemporalwindowofaudiospeechsignal. The neuralnetworkcanbetrainedtorespondtoanarrayof MFCC windows,usingthe futureand/orpastaudiodata. Theconversioncanbeasgoodastheamountofmutual informationbetweentheaudioandvideorepresentations.

Usingthetrainedneuralnetforcalculationofcontrolparametersoffacial animation model

Theaudioprocessingunitextractstheaudio MFCCfeaturevectorsfromtheinput speechsignal. Fiveframesof MFCCvectorsareusedasinputtothetrainedneural net. TheNNprovideFacePCAweightvectors. Theseareconvertedintothecontro lpa-rametersofa MPEG-4standardfaceanimationmodel. Thetestofﬁttingofaudioand videofeatureswasbasedonstep-by-steptemporalshiftingoffeaturevectors.Indicator ofthematchingwas mutualinformation.Lowlevel mutualinformation meansthatwe havelowaveragechancetoestimatethefacialparametersfromtheaudiofeatureset. Thetimeshiftvaluetoproducethehighest mutualinformation meansthe maximal averagechancetocalculatetheonekindofknownfeaturesfromtheotherone.

Estimationof mutualinformationneedscomputationintensivealgorithm. The calculationisunrealisticusinglargedatabase with multidimensionalfeaturevectors. Sosingle MFCPCAandFacePCAparameterswereinterrelated. Sincethesing lepa-rametersareorthogonalbutnotindependent,theyarenotadditive. Forexamplethe FacePCA1valuesarenotindependentfromFacePCA2. Themutualinformationcurves eveninsuchcomplexcasescanindicatetheinterrelationsofparameters.

Analternative methodistocalculatecrosscorrelation. Wehavealsotestedthis method.Itneedslesscomputationalpowerbutsomeofrelationsarenotindicatedso itisalowerestimationoftheoretical maximum.

4.1.1Introduction 37

MutualinformationishighifknowingXhelpstoﬁndoutwhatisY,anditislowif XandYareindependent.Tousethismeasurementfortemporalscopetheaudiosignal willbeshiftedintimecomparedtothevideo.Ifthetimeshiftedsignalhasstillhigh mutualinformation,it meansthatthistimevalueshouldbeinthetemporalscope.If thetimeshiftistoohigh, mutualinformationbetweenthevideoandthetimeshifted audiowillbelowduetotherelativeindependenceofdiﬀerentphonemes.

Usingaandvasaudioandvideoframes:

∀∆t∈[−1s,1s]:MI(∆t)=ⁿ

t=1

P(at+∆t,vt)log P(a^t+∆t,vt)

P(at+∆t)P(vt) (4.2) where P(x,y)isestimatedbya2dimensionalhistogramconvolved with Gauss window. Gausswindowisneededtosimulatethecontinuousspaceinthehistogram incases whereonlyafewobservationsarethere. Sinceaudioandvideodataare multidimensionaland MIworkswithonedimensionaldata,allthecoeﬃcientvectors wereprocessed,andtheresultsaresummarized. The mutualinformationvalueshave beenestimatedfrom200x200sizejointdistributionhistograms. Thehistogramshave beensmoothedby Gaussian window. The windowhas10cellradius with2.5cell deviation. The marginaldensitydistributionfunctionshavebeencalculatedfromthe sumofjointdistributionfunctions.

In document Audiotovisualspeechconversion GergelyFeldhoﬀer (Pldal 27-57)