Audiotovisualspeechconversion GergelyFeldhoﬀer

(1)

Aud iotov isua lspeechconvers ion

Gerge lyFe ldhoﬀer

Athesissubmittedforthedegreeof DoctorofPhilosophy

Scientificadviser: GyörgyTakács

FacultyofInformationTechnology

InterdisciplinaryTechnicalSciencesDoctoralSchool PázmányPéterCatholicUniversity

Budapest,2010

(2)

(3)

I wouldliketothankthehelpof mysupervisor György Takács,and my actualandformercolleaguesAttilaTihanyi,TamásBárdi,TamásHarczos, BálintSrancsik,andBalázsOroszi.Iamthankfulformydoctoralschoolfor providingtoolsandcaringenvironmentto mywork,especiallypersonally forJuditNyéky-GaizlerandTamásRoska.

IamalsothankfulforIvánHeged˝us,GergelyJung,JánosV´ıg, MátéTóth, Gábor Dániel“Szasza”Szabó, Balázs Bányai, László Mészáros, Szilvia Kovács,SoltBucsiSzabó,AttilaKrebszand MártonSelmecistudents,who participatedinourresearchgroup.

Myworkwouldbelesswithoutthediscussionswithvisualspeechsynthesis expertsasL´aszl´oCzap,TakaashiKuratateandSashaFagel.

IwouldliketothankalsomyfellowPhDstudentsandfriends,especiallyto Béla Weiss, GergelySoós, Ádám Rák,ZoltánFodrózci, Gaurav Gandhi, György Cserey, Róbert Wágner, Csaba Benedek, Barnabás Hegyi, Éva Bankó,KristófIván,GáborPohl,BálintSass, Márton Miháltz,FerencLom- bai,NorbertBérci,ÁkosTar,JózsefVeres,AndrásKiss,DávidTisza,Péter Vizi,BalázsVarga,LászlóFüredi,BenceBálint,LászlóLaki,LászlóL˝ovei andJózsef Mihaliczafortheirvaluablecommentsanddiscussions.

Ithanktheendlesspatienceandhelpfulnessto Mrs Vida,L´ıvia Adorján, MrsHaraszti, MrsKörmendy,GabriellaRumi, MrsTihanyiandMrs Mikesy. Ialsothankthesupportofthetechnicalstaffoftheuniversity,especially PéterTholt,TamásCsillagandTamásRec.

AndﬁnallybutnotleastIwouldliketothankthepatientandlovingsupport of mywifeBernadettand myfamily.

(7)

Abstract

Inthisthesis,Iproposenewresultsinaudiospeechbasedvisualspeechsynthesis, whichcanbeusedashelpforhardofhearingpeopleorincomputeraidedanimation. Iwilldescribeasynthesistoolwhichisbasedondirectconversionbetweenaudioand video modalities.I willdiscussthepropertiesofthissystem, measuringthespeech qualityandgivesolutionsforoccurrentdrawbacks.I willshowthatusingadequate trainingstrategyiscriticalfordirectconversion. AttheendIconcludethatdirect converisoncanbeusedaswellasotherpopularaudiotovisualspeechconversions,and itiscurrentlyignoredundeservedlybecauseofthelackofeﬃcienttraining.

1

(8)

(9)

Chapter1

Introduct ion

Audiotovisualspeechconversionisanincreasinglypopularapplicableresearchﬁeld today. MainconferencessuchasInterspeechorEurasipstartednewsectionsconcerning multimodalspeechprocessing,Interspeech2008heldaspecialsessiononlyforaudioto visualspeechconversion.

Possibleapplicationsoftheﬁeldarecommunicationaidingtoolsfordeafandhard ofhearingpeople[1]bytakingadvantageofthesophisticatedlip-readingcapabilities ofthesepeople,orlip-syncapplicationsintheanimationindustry,incomputeraided animationsaswellasinreal-timetelepresencebasedvideogames.InthisthesisIwill describesolutionsforbothoftheseapplications.

InthischapterI willshowtheactualstatusofthetopic, motivationsandstate ofthearttechniques. Tounderstandthischapterbasicspeechandsignalprocessing knowledgeisneeded.

1 .1 Deﬁn it ions

Speechisa multimodalprocess. The modalitiescanbeclassiﬁedasaudiospeechand visualspeech.Iwillusethefollowingterms:

Visualspeech isarepresentationoftheviewofafacetalking.

Visualspeechdata isthe motioninformationofvisiblespeechorgansinanyrepre- sentation.

Phonemeisthebasic meaningdistinctivesegmentalunitoftheaudiospeech.Itis languagedependent.

Viseme isthebasic meaningdistinctivesegmentalunitofthevisualspeech. Also languagedependent.Therearevisemesbelongingtophonemes,andtherearephonemes whichdonothavevisemeinparticular,becausethephonemecanbepronouncedwith morethanonewaysofarticulation.

Automaticspeechrecognition(ASR) isasystemora method whichcanextract phoneticinformationfromaudiospeechsignal. Usuallyaphonemestringisproduced.

Audiotovisualspeech(ATVS)conversionsystemsaretocreateananimationofa faceaccordingtoagivenaudiospeech.

3

(10)

4 1.INTRODUCTION

Figure1.1: Taskofaudiotovisualspeech(ATVS)conversion.

DirectATVSisanATVSwhich mapsaudiorepresentationtovideorepresentation byapproximation.

DiscreteATVSisanATVSwhichusesclassiﬁcationintodiscretecategoriesinorder toconnectthe modalities. Usuallyphonemesandvisemesareused.

ModularATVS isanATVSwhichcontainsASRsubsystem,andphoneme-viseme mappingsubsystem. ModularATVSsystemsareusuallydiscrete.

AV mapping isaninput-output methodwheretheinputisaudiodatainanyrep- resentationandtheoutputisvisualdatainanyrepresentation.Inadiscrete ATVS, AV mappingisaphoneme-viseme mapping,inadirectATVSthisisanapproximator. 1.1.1 Components

EachATVSconsistoftheaudiopreprocessor,theAV(audiotovideo) mapping,and thefacesynthesizer. The moststraightforward methodisthejaw-openingdrivenby speechenergy,thissystemiswidelyusedinon-linegames,sotheaudiopreprocessorisa frame-by-frameenergycalculationexpressedindB,theAVmappingisalinearfunction, which mapstheonedimensionalaudiodatatotheonedimensionalvideoparameter, thejawopening. Theface modelisusuallyavertexarrayoftheface,andbymodifying theverticesofthejawthefacesynthesisisdone.Inbelow moresophisticatedcases willbedetailedwherenaturalnessandintelligibilityareissues.

Recentresearchactivitiesareonspeechsignalprocessing methodsspeciallyfor lip-readablefaceanimation[2],facerepresentationandcontroller method[3],andcon- vincinglynaturalfacialanimationsystems[4].

(11)

1.1.2 Qualityissues 5

Audiopreprocessing

Thesesystemsusefeatureextraction methodstogetusefulandcompactinformation fromthespeechsignal. The mostimportantaspectsofqualityherearetheextracted representationdimensionalityandcoveringerror. Forexamplethespectrumcanbe approximatedbyafewchannelsof melbandsreplacingthespeechspectrumwitha certainerror.Inthiscasethedimensionalityisreducedgreatlybyallowingcertainnoise intherepresenteddata. Databasesforneuralnetworkshavetoconsiderdimensionality asaprimaryaspect.

Audiopreprocessing methodscanbeclusteredin manyaspectsastimedomainor frequencydomainfeatureextractors,approximationorclassification,etc. Adeeper analysisofaudiopreprocessing methodsconcerningaudiovisualspeechispublished by[5]resultingthe mainapproachesareapproximatelyequallywell. Thesetraditional approachesarethe MelFrequencyCepstralCoefficients(MFCC)andLinearPrediction Coding(LPC)based methods. AquiteconvenientpropertyofLPCbasedvocaltract estimationisthedirectconnectiontothespeechorgansviathepipeexcitation model. ItseemstobeagoodideatousevocaltractforATVSaswellbutaccordingto[5]it hasnotsignificantly moreusabledata.

AV mapping

Inthisstepthe modalitiesareconnected,visualspeechdataisproducedfromaudio data.

Therearediﬀerentstrategiesforperformingthisaudiotovisualconversion. Oneap- proachistoexploitautomaticspeechrecognition(ASR)toextractphoneticinformation fromtheacousticsignal. Thisisthenusedinconjunctionwithasetofcoarticulation rulestointerpolateavisemicrepresentationofthephonemes[6,7]. Alternatively,a secondapproachistoextractfeaturesfromtheacousticsignalandconvertdirectly fromthesefeaturestovisualspeech[8,9].

Facesynthesis

Inthisstepthevisualspeechrepresentationisappliedtoaface model. Usuallysepa- ratedtotwoindependentpartsasfacialanimationrepresentationandmodelrendering. Thefacerepresentation mapsthequantitativedataonfacedescriptors. Anexampleof thisisthe MPEG-4standard.Facerenderingisamethodtoproducepictureoranima- tionfromfacedescriptions. Theseareusuallycomputergraphicsrelatedtechniques. 1.1.2 Qualityissues

AnATVScanbeevaluatedondiﬀerentaspects.

-naturalness: how muchisthesimilarityoftheresultofthe ATVSanda realpersonsvisualspeech

(12)

6 1.INTRODUCTION

-intelligibility:howtheresulthelpsthelip-readertounderstandthecontent ofthespeech

-complexity:thesystemsoveralltimeandspacecomplexity

-trainability:howeasyistoenhancethesystem’sotherqualitiesbyexam- ples,isthisprocessfastorslow,isitadaptableorﬁxed

-speakerdependency:howthesystemperformancevariesbetweendiﬀerent speakers

-contextdependency: howthesystemperformancevariesbetweenspeech contents(eg. asystem, whichtrainedon medicalcontent, mayperform pooreronﬁnancialcontent)

-languagedependency: howcomplexistoportthesystemtoadiﬀerent language. Thereplacementofthedatabasecanbeenough,or mayrules havetobechanged,eventhepossibilitycanbequestionable.

-acousticalrobustness:howthesystemperformancevariesindiﬀerentacous- ticenvironments,likehighernoise.

1 .2 App l icat ions

InthissectionIdescribesomeoftherecentsystems,andgiveashortdescriptionof theminthequalityaspectsdetailedabove.

1.2.1 Synface

Anexampleof ATVSsystemsistheSynface[1]of KTH,Sweden. Thissystemisde- signedforhearingimpairedbutnotdeafpeopletohandlevoicecallsonphone. Thesys- temconnectsthephonelinetoacomputer,whereaspeechrecognitionsoftwaretrans- latestheincomingspeechsignaltoatimealignedphonemesequence. Thisphoneme sequenceisthebasisoftheanimationcontrol. Eachphonemeisassignedtoaviseme, andtherecognizedsequence makesastringofvisemestoanimate. Thespeechrecog- nitionsubsystemnotjustrecognizesthephonemebut makesthesegmentationalso. Thevisemesequencetimedbythissegmentationinformationgivestheﬁnalresultof theAV mapping,usingarule-basedstrategy. Therulesetiscreatedbyexamplesof Swedish multimodalspeech.

Thissystemisdeﬁnitelylanguagedependent,itusestheSwedishphonemeset, aSwedish ASR,andarulesetbuiltonSwedishexamples. Ontheotherhandthe systemperformsvery wellinaspectsofintelligibility,acousticalrobustness,speaker andcontextdependency.

(13)

1.2.2Synthesisofnonverbalcomponentsofvisualspeech 7

1.2.2 Synthesisofnonverbalcomponentsofvisualspeech

Anexampleofaudiotovisualnon-verbalspeechestimationisthesystemof Gregor HoferandHiroshiShimodaira[9]. Theirsystemtargetstoextractthecorrecttimeof blinkinspeech. Theaudiopreprocessinginthissystemconcentratesonnon-verbal components,suchasrhythm,andintonation. Comparedtoactualvideos,theoriginal audiosignalwasusedtotesttheprecisionoftheestimation,whichwasabove80%with adecenttimetolerationof100 ms.Itisimportantthattherearetwokindsofblink, oneoftheregulareyecare,fastblink,andtheotheristhenon-verbalvisualspeech componentemphasizedblink. Ofcoursethisworkwasfocusedthesecondvariant.

1.2.3 Expressivevisualspeech

Thisﬁeldchangedthenamefrom“Emotionalspeech”to“Expressivespeech”because ofpsychologicalreasons.Expressivespeechtargetstosynthesizeorrecognizeemotional expressionsinspeech.Expressingemotionsisveryrelevantinvisualspeech.

Ishowtwoapproachestotheﬁeld.PietroCosielalworkonthevirtualhead“Lucia”

[10]toconnectanexpressiveaudiospeechsynthesizerwithavisualspeechsynthesizer. Thistextbasedsystemcanbeusedasanaudiovisualagentonanyinteractive media wheretextcanbeused.Forexpressivevisualspeechitusesvisemesfortextualcontent andfourbasicemotionalstatesofthefaceasexpressivespeechbasis. Theyworkona naturalblendingfunctionofthesestates.

SashaFagelworksonexpressivespeechinabroadsense[11]. Hecreatedamethodto helpcreatingexpressiveaudiovisualdatabasesbyleadingthesubjectthroughemotional stagestoreachthedesiredlevelofexpressiongradually. Thiswayitispossibletorecord emotionallyneutralcontent(eg.“ItwasonFriday”)articulatedwithjoyoranger. The trickistorecordthesentencemultipletimesandinsertingemotionallyrelevantcontent betweentheoccurrences. Oneexamplecouldbethesequence“Troublehappenalways with me!ItwasonFriday. Whatdoyouthinkyouare?!ItwasonFriday.Ihateyou! It wasonFriday.” This methodgivesthespeakertheguidetoexpressanger which graduallyincreasesinexpressiveness. Thedatabasewillcontainonlytheoccurrences ofemotionallyneutralcontent.

1.2.4 Speechrecognitiontoday

Asof2010,afteradecade,thehegemonyofHidden Markov Model(HMM)basedASR systems[12]isstillstanding. Thisapproachusesalanguage modelformulated with consecutiveness-functions,andapronouncation modelwithconfusionfunctions.

The mainreasonofthepopularityofHMMbasedASRsystemsistheefficiencyof handlingmorethousandsofwordsinaformalgrammar. Thisgrammarcanbeusedto focusthevocabularyaroundaspecifictopictoincreasethecorrectnessandreducing thetimecomplexity. HMMcanbetrainedtoaspecificspeakerbutalsocanbetrained onlargedatabasestoworkspeakerindependent.

(14)

8 1.INTRODUCTION

Figure1.2: Mouthfocusedsubsetoffeaturepointsof MPEG-4.

1.2.5 MPEG-4

MPEG-4isastandardforfacedescriptionforcommunication.ItusesFeaturePoints (FP)andFacial AnimationParameters(FAP)todescribethestateoftheface. The constantpropertiesofafacealsocanbeexpressedin MPEG-4,forexamplethesizes inFAPU.

Theusageof MPEG-4istypicalinmultimediaapplicationswhereaninteractiveor highlycompressedpre-recordedbehaviorofafaceisneeded,suchasvideogamesor newsagents. Oneofthe mostpopular MPEG-4systemsisFacegen[13].

Forvisualspeechsynthesis MPEG-4isafairchoicesincethereareplentyofim- plementationsandresources. Thedegreeoffreedomaroundthe mouthisclosetothe actualneeds,buttherearefeatureswhichcannotbe modeledwith MPEG-4,suchas inﬂation. GerardBaillyetalshowedthatusingmorefeaturepointsaroundthe mouth canincreasenaturalnesssigniﬁcantly[14].

1.2.6 Facerendering

Thetaskofthesynthesisofthepicturefromfacedescriptorsisfacerendering. Usually 3Denginesareused,but2Dsystemsarealsocanbefound. Thespectrumofthe approachesandimagequalitiesisverywidefromeﬃcientsimpleimplementationsto musclebasedsimulations[4].

Mostofthefacerenderersuse3Daccelerationandvertexarraystointerpolate, whichisafairlyacceleratedoperationintoday’svideocards.Inthiscaseafewgiven vertexarrayrepresentgivenphasesoftheface,andusinginterpolationtechniques,the statusofthefacecanbeexpressedasaweightedsumofthepropervertexarrays. The resultingstatecanbetexturedandlightedjustasoneoftheoriginaldesignedfacial phases.

(15)

1.3. OPEN QUESTIONS 9

1 .3 Openquest ions

Itisclearthatfacemodelingandfacialanimation —subtasksofaudiotovisualspeech conversion —arestillevolvingbut mainlyadevelopmentﬁelds,butthereareexam- plesofresearchareas,suchasapproximationofthehumanskin’sphysicalproperties, connectionofthe modalities,evaluationofAV mapping,whatisspeakerdependentin thearticulation, whatisthe minimalnecessarydegreesoffreedomforperfectfacial modeling.

Mymotivationscovertheapplicableresearchontheconnectionbetweenthemodal- ities. Thisisalsoanopenquestion. Thereareconvenientargumentsforthephysical relationbetweenthe modalities:thespeechorgansareusedbothforaudioandvisual speech,althoughsomeofthemarenotvisible. There mustarephysicaleﬀectsofthe visiblespeechorganstotheaudiospeech.

Ontheotherhand,therearephenomenawheretheconnectionbetweenthe modal- itiesare minimal.Speechdisorderscaneﬀecttheaudiospeechwithoutavisibletrace. Ventriloquism(theartofspeakingwithoutlipmovement,usuallyperformedwithpup- petscreatingtheillusionofaspeakingpuppet)isalsoaninterestingexception.

ToavoidinconsistencyIturnedtotheclariﬁedtopicofaudiotovisualspeech conversion.

1 .4 Theproposedapproachofthethes is

Thephysicalconnectionbetweenthe modalitiescangiveguidelinetoreachbasiccon- versionfromaudiotovideo,butthisgoalisnotclear withoutspecifiedaspectsof qualities. Thenextchapterwilldetailhowourresearchgroup metthefieldthrough theaidofdeafandhardofhearingpeople. Their mainqualityaspectsoftheresulting visualspeecharethelip-readability,andthenaturalness. Thiswaytheproblemcan beredefinedtosearchthe mostappropriatevisualspeechforthegivenaudiospeech signal,nottorestoretheoriginalvisualarticulation.

Thephysicalconnectioncanbeutilizedeasilythroughdirectconversion. Direct ATVSsystemsarenotspeechrecognitionsystems,thetargetistoproduceananimation withoutrecognizinganyofthelanguagelayersasphonemesorwords,asthispartof theprocessislefttothelip-reader. Becauseofthis,our ATVSusesnophoneme recognition,furthermorethereisnoclassiﬁcationpartintheprocess. Thisisthedirect ATVS,avoidinganydiscretetypeofdataintheprocess. DiscreteATVSsystemsare usingvisemesasthevisualmatchofphonemestodescribeagivenstateoftheanimation ofaphoneme,andusinginterpolationbetweenthemtoproducecoarticulation.

Oneofthemostimportantbeneﬁtsofthedirectconversionisthechancetoconserve nonverbalcontentofthespeechsuchasprosody,dynamicsandrhythm. ModularATVS systemshavetosynthesizethesefeaturesto maintainthenaturalnessoftheresult.

(16)

1 .5 Re latedd isc ip l ines

1.5.1 Speechinversion

Ourtaskissimilartospeechinversionwhichtendstoextractinformationfromspeech signalaboutthestatesequenceofthespeechorgans. However,speechinversionaims toreproduceeveryspeechorgantoexactlythesamestateasthespeakerusedhis organs,witheveryspeakerdependentproperty[15,16]. ATVSisdiﬀerent,thetargetis toproducealip-readableanimationwhichdependsonlyonthevisiblespeechorgans anddoesnotdependonthespeakerdependentfeaturesofthespeechsignal.

Speechinversionaimstorecoverthestatesequenceofthespeechorgansfromspeech. Averysimple modelandsolutionofthisproblemisthevocaltract. Recentresearch onthisﬁeldconcernstongue motionand modelsinparticular.

1.5.2 Computergraphics

Synthesisofhumanfaceisachallengingfieldofcomputergraphics. The mainreason ofthehighdifficultyistheverysensitivehumanobserver. Thehumankinddeveloped ahighlysophisticatedcommunicationsystemwithfacialexpressions,itisabasichu- manskilltoidentifyemotionalandcontextualcontentfromaface. Anexampleof cuttingedgefacesynthesissystemsistherenderingsystemofthe movieAvatar(2009) wherethesystemparameterswereextractedfromactors[17]. Therearerecentscien- tificresultsofefficientvolumeconservingdeformationsoffacialskinbasedonmuscular modeling[4]. These modernrendering methodscanreproducecreasingoftheface, whichisperceptionallyimportant.

1.5.3 Phonetics

Thescienceofphoneticsisrelatedto ATVSsystemsbythe ASRbasedapproaches. PhoneticallyinterestingareasaretheASRcomponent,thephonemestringprocessing, therulesappliedonphonemestringstosynthesizevisualspeech,suchasinterpolation rules,ordominancerules.

Thedetailsofarticulation,andtherelationofthephoneticcontentandthefa- cial musclecontrolsisthetopicofarticulatoryphonetics[18,19]. Thisfieldclassifies phonemesbytheirplacesofarticulation:labial-dental,coronal,dorsal,glottal. ATVS systemsareawareofvisiblespeechorgans,solabial-dentalconsonantsareimportant, alongvowelsandarticulations withopen mouth. Forexamplethephoneme“l”is identifiableofit’salveolararticulationsinceitisdonewithopened mouth.

ArticulatoryphoneticshaveimportantresultsforATVSsystems,aswewillseethe detailsofvisualspeechsynthesisfromphonemestrings.

10

(17)

Chapter2

Mot ivat ionandthebasesystem

InthischapterIwilldescribethemaintasksIhadtodealwith,showingthemotivation of mythesis.Iwilldescribeabasesystemaswell. Thebasesystemitselfisnotpart ofthecontributionof mythesis,althoughunderstandingthebasesystemisimportant tounderstanding my motivations.

2 .1 SINOSZproject

TheoriginalprojectwithSINOSZ(NationalAssociationofDeafandHearingImpaired) aimeda mobilesystemtohelpdealingwithaudioonlyinformationsourcesforhard ofhearingpeople. Theﬁrstidea wastovisualizetheaudiodatainsomelearnable representation,buttheassociationrejectedanyvisualizationtechniquewhich mustbe learnedbythedeafcommunity,sothevisualization methodhadtobesomealready knownrepresentationofthespeech. Wehadbasicallytwochoices,toimplementan ASRtogettext,ortranslatetofacial motion. Weexpected moreeﬃcientandrobust qualityoffacial motionconversionwiththecapabilitiesofa mobiledevicein2004.

Thedevelopmentofthe mobileapplication wasinitiated withtheproject. The mobilebranchoftheprojectisoutofthescopeof mythesis,althoughtherequirement ofeﬃciencyisimportant.

2.1.1 Apracticalview

WhenIstartedtoworkonaudiotovisualspeechconversion,afterexaminingsomeof thesystemsinaspectsofrequirementsandqualitiesdetailedinthepreviouschapter,I decidedtousedirectconversion. The mainreasoninthistimewastogetafunctional andefficienttestsystemassoonaspossibletohaveresultsandfirsthandexperience withthehopeofsufficiallyefficientimplementationlater.

Directconversioncanbedeployedon mobileplatformseasierthandatabasede- pendentclassiﬁersystems. Notonlythecomputationaltimeis moderated,butthe memoryrequirementsarealsolower. Choosingdirectconversionwastheoptionofthe guaranteedpossibilityofthetestimplementationonthetargetplatform.

11

(18)

12 2. MOTIVATION AND THE BASESYSTEM

2 .2 Luc ia

InthebeginningoftheprojectIconvincedtheteamtousedirect mappingbetween modalities. Mytwoimportantreasonsweretheeﬃciencyandthelackoftherequire- mentofalabeleddatabaseunlikean ASR.Since wedidnothaveanyaudiovisual databases(andevenin2009therearequitefewpubliclyavailable)wehadtothinkon notonlythesystembutthedatabasealso. Directconversiondoesnotneedlabeled dataso manualworkcanbe minimized,whichshortenstheproductiontime.

Sotheplannedsystemcontainedasimpleaudiopreprocessing(LPCor MFCC),a direct mappingtovideofromaudiofeaturevectorsviacode-bookorneuralnetwork, andvisualizationoftheresultonanartiﬁcialhead. Wedidnothaveanyface models, neitherwantedtocreateone,sowewerelookingforanavailablehead model.

TheﬁrsttestsystemusedthetalkingheadofCosietal[10]calledLucia. Thehead modelwasoriginallyusedforexpressivespeechsynthesis. Thesystemused MPEG-4 FAPasinput,andgeneratedarun-timevideoinan OpenGLwindow,andexporting invideoﬁleswasalsoavailable.

2 .3 Thebasesystem

Thebasesystemisanimplementationofdirectconversion2.1. 2.3.1 Databasebuildingfromvideodata

Thedirectconversionneedspairsofaudioandvideodata,sothedatabaseshouldbe a(maybelabeled)audiovisualspeechrecordingwherethevisualinformationisenough tosynthesizeahead model. Thereforewerecordedafacewith markersonthesubset of MPEG-4FPpositions, mostlyaroundthe mouthandjawandalsosomereference points. Basicallythisisapreprocessed multimedia materialspeciallytouseitasa trainingsetforneuralnetworks. Forthispurposethedatashouldnotcontainstrong redundancyforproacticallyacceptablelearningspeed,sothepre-processingincludes thechoiceofanappropriaterepresentationalso. Withinadequaterepresentationthe learning maytake months,or maynotevenconverges.

2.3.2 Audio

Thevoicesignalisprocessedby25frame/sratetobeinsynchronywiththeprocessed videosignal. Oneanalysiswindowis20-40 ms,themaximumnumberofsamplesinthe 40 mswindowtobe2ⁿsamples. Theinputspeechcanbepre-emphasisﬁlteredwith H(z)=1-0.983z-1. HammingwindowandFFTwithRadix-2algorithmareapplied. The FFTspectraisconvertedto16 mel-scalebands,andlogarithmand DCTisapplied. Such MelFrequencyCepstrumCoeﬃcients(MFCC)featurevectorsarecommonlyused ingeneralspeechrecognitiontasks. The MFCCfeaturevectorsprovidetheinputto theneuralnetworksafterscalingto[-0.9..0.9].

(19)

2.3.2 Audio 13

Figure2.1: WorkﬂowusedinLucia.

(20)

2.3.3 Video

Forvideoprocessingweusedtwomethods.Bothmethodsarebasedonvideorecording ofaspeakerandfeaturetrackerapplications. Thefirst methodisbasedon markers only whichareplacedaroundthe mouth. The markers wereselectedasasubsetof the MPEG-4facedescriptionstandard. Trackingthe markersisacomputeraided process;a98%precise markertrackeralgorithm wasdevelopedforthisphase. The mistakes werecorrected manually. The markerpositionsasafunctionoftime were therawdata, which wasnormalizedbycontrolpointssuchasthenosetoeliminate the motionofthe wholehead. Thisgivesa30-36dimensionalspacedependingon markercount. Thisdataisveryredundantandhighdimensional,itisnotsuitable forneuralnetworktraining,so PCA wasappliedtoreducethedimensionalityand eliminatetheredundancy. PCAcanbetreatedasalossycompressionbecauseonly thefirst6parameterswereusedfortraining. Usingonly6coefficientscancauseabout 1pixelerroronPALscreenwhichistheprecisionofthe markertracking. Thefirst4 coefficientcanbeseenonFig2.2.

Thebasesystems’svideodatabaseisasetofvideorecordsofprofessionallip- speakers. Their movingfacesaredescribedbythe15elementsubsetofthestandard MPEG-4featurepoints(FP)set(84). Thesefeaturepointswere markedbycolored dotsonthefaceofthespeakers. Thecoordinatesoffeaturepointswerecalculatedby a markertrackingalgorithm.

The markertrackingalgorithmusedthenumberof markers(nm)asinput,andon eachframeitlookedforthenmmost marker-likeareasofthepicture. The marker- likelinesswasgivenashighenergyﬁxedsizedblobafteryellowﬁltering. Thetracking containedaself-checkbylookingforadditionalmarkers,andbycomparingthemarker- likelinessesofthe marker[..,nm−1,nm,nm+1,..]thegoodtrackingshowstrong decreaseafterthenmmarker.Ifthedecreaseisbefore nmthereare missing markers, ifthedecreaseifafternmthereare misleadingblobsintheframe. Usingtheself-check ofthetracking, manualcorrectionswas made.

TheFPcoordinates means30dimensionalvectorswhicharecompressedbyPCA. WehaverealizedthattheﬁrstfewPCAbasisvectorshavecloserelationstothebasic movementcomponentsoflips.Suchcomponentscandiﬀerentiatevisemes. Themarker coordinatesaretransformedintothisbasis,andwecanusethetransformationweights asdata(FacePCA).TheFacePCAvectorsarethetargetoutputvaluesoftheneural netduringthetrainingphase[8].

2.3.4 Training

Thesynchronyoftheaudioandvideodataischeckedbyword”papapa”inthebe- ginningandtheendoftherecording. Theﬁrstopeningofthe mouthbythisbilabial canbesynchronizedwiththeburstintheaudiodata. Thissynchronizationguarantees thatthepairsofaudioandvideodatawererecordedinthesametime. Forthebest resulttheneuralnetworkhastobetrainedonmultiplewindowsofaudiofeaturevectors wherethewindowcounthastobechosenbasedontheoptimaltemporalscope.

(21)

2.3.4 Training 15

Figure2.2: Principalcomponentsof MPEG-4featurepointsinthedatabaseofapro- fessionallip-speaker.

(22)

Theneuralnetworkisaback-propagationimplementationbyDavideAnguitacalled MatrixBackPropagation[20]. Thisisaveryeﬃcientsoftware,weuseaslightlymodiﬁed versionofthesystemtoabletocontinueatrainingsession.

2.3.5 Firstresults

Thedescribedmoduleswereimplementedandtrained. Thesystemwasmeasuredwith arecognitiontestwithdeafpeople. Tosimulateameasurablecommunicationsituation, thetestcoverednumbers,namesofdaysoftheweekandmonths. Asthemeasurement aimedtotellthedifferencebetweentheATVSandarealperson’svideo,thesituation hadtobeinconsiderationofaveragelip-readingcases. Aswefound[8]deafpersons reclineuponcontext morethanhearingpeople.Inthecasesofnumbersornamesof monthsthecontextdefinesclearlytheclassofthe wordbutleavetheactualvalue uncertain. Duringthetestthetestsubjectshadtorecognize70wordsfromvideoclips. Onethirdoftheclipswereoriginalvideoclipfromtherecordingofthedatabase,other onethirdwereoutputofthe ATVSfromaudiosignalsandtheremainingonethird weresynthesizedvideoclipsfromtheextractedvideodata. Thedifferencebetween therecognitionofrealrecordingandthefaceanimationfromtheextractedvideodata givestherecognitionerrorfromtheface modelandthedatabase,asthedifference betweenanimationsfromvideodataandaudiodatagivesthequalityoftheaudioto videoconversion. Table2.1showstheresults.

Table2.1: Recognitionratesofdiﬀerentvideoclips. Material Recognitionrate originalvideo 97%

face modelonvideodata 55%

face modelonaudiodata 48%

2.3.6 Discussion

Inthiscasethe45%shouldbecomparedtothe55%. Theface models,whichisdriven byrecordedvisualspeechdataisthebestpossiblebehaviorofthedirectATVSsystem. Theratiooftheresultsis87%.Itmeansthatthebasesystemcan makedeafpeopleto understand87%ofthebestpossiblesystem. Thiswasveryencouragingexprerience.

The55%isa weakresultcomparedtothe97%oforiginalvideo. Thisfalloﬀis becauseoftheartiﬁcalhead. Thisratiocouldbeenhancedbyusingmoresophisticated head modelsandfacialparameters,butthisdirectionofresearchanddevelopmentis outofscopeofthisthesis.

(23)

2.4. JOHNNIE TALKER 17

2 .4 Johnn ie Ta lker

JohnnieTalkerisareal-timesystemwithverylowtimecomplexity,pureAV mapping applicationwithasimpleface modelandfacialanimation.Johnniewasimplemented todemonstratethelowtimecomplexityofdirectATVSapproach. Theapplicationisa developmentofourresearchgroup,ituses myimplementationofAV mapping,Tamás Bárdi’saudiopreprocessingandBálintSrancsik’sOpenGLbasedhead model.

Johnnieisfreelydownloadablefromthewebpageoftheauthorofthisdissertation[21]. Itisa WindowsapplicationusingOpenGL.

Becauseofthedemandoflowlatencynotheoreticaldelaywasusedinthissystem. InthenextfewchaptersIwilldescribehowthenaturalnessandtheintelligibilitycan beenhancedbyusingatime windowinthefutureofaudio modality. Thiscanbe implementedbydelayingtheaudiodubto maintainaudio-videosynchronyandusing thefutureaudiointhesametime.Forexampleaphonelinecanbedelayedtothethe- oreticallyoptimaltimewindow. ButsinceJohnniecanbeusedvia microphone,which cannotbedelayed,anyadditionalbuﬀeringwouldcausenoticeablelatencywhichruins thesubjectivesenseofquality. Asoneofthenextchapterswilldescribe,subjective qualityevaluationdependsheavilyonaudio-videosynchrony,andthisphenomenaap- pearsstronglyintheperceptionofasynthesizedvisualspeechofone’sownspeechin real-time.

JohnnieTalkerwasshownonvariousinternationalconferenceswithsuccess.Itwas agoodopportunitytotestlanguageindependenceinpractice.

We werelookingfortechniquestoimprovethequalitiesofthereal-timesystem withoutadditionalrun-timeoverhead. Therewillbeachapterabouta methodwhich canenhancespeakerindependenceofthesystemusingonlydatabase modiﬁcations,so norun-timepenaltyneeded.

2 .5 Extend ingd irectconvers ion

2.5.1 Direct ATVSandco-articulation

The mostcommonformofthelanguageisthepersonaltalkwhichisanaudiovisual speechprocess. Ourresearchisfocusedontherelationoftheaudioandthevisualpart oftalkingtobuildasystemconvertingvoicesignalintofaceanimation.

Co-articulationisthephenomenaoftransientphasesinspeechprocess.Inaudio modality,co-articulationistheeffectoftheneighboringphonemestotheactualstate ofspeechinashortwindowofthetime,shorterthanaphonemeduration.Inspeech synthesis,thereisastrongdemandtocreatenaturaltransientsbetweenthecleanstates ofspeech.Invisualspeechsynthesisthisissueisalsoimportant.Inthevisualspeech processtherearevisemesevenifthesynthesizerdoesnotexplicitlyusethisconcept. Visualco-articulationcanbedefinedasasystemofinfluencesbetweenvisemesin time. Becauseofbiologicallimitations,visualco-articulationisslowerthanaudioco- articulation,butsimilarinother ways: neighboringvisemescanhaveeffectoneach

(24)

other,therearestrongervisemesthanothers,and mostofthecasescanbedescribed orapproximatedasaninterpolationofneighboringvisemes.

Let mecallasystemvisualspeechtransient model,ifitgenerates mediatestatesof visualspeechunits,suchasvisemes. Anexampleofvisualspeechtransientmodelisthe strictlyadoptedco-articulationconceptonvisemes,thevisualco-articulation,sincethe visemestringprocessinghastodecidehowinterpolationshouldtakeplacebetweenthe visemes. Anotherexampleofvisualspeechtransient modelsisthedirectconversion’s adaptationtolongertime windowsinordertoinclude morethanonephonemeon theaudio modality.Inthiscasethetransientsdependonacousticalproperties.In modularATVSsystems,thetransientsarecodedinrulesdependingonvisemestring neighborhoods.

Utilization

Trainingadirect ATVSneedsaudio-videodatapairs. Sinceplentyofspeechaudio databasesexistbutonlyafewaudiovisualones,buildingadirectATVSmeansbuilding amultimodaldatabasefirst. AdiscreteATVSisamodularsystem,itispossibletouse existingspeechdatabasestotrainvoicerecognition,andseparatelytraintheanimation partonphonemepairsortrigraphs[1]. ThereforedirectATVSneedsaspecialdatabase, butthesystemwillhandleenergyandrhythmnaturally,meanwhileadiscreteATVShas toreassemblethephonemesintoafluidcoarticulationchainofvisemeinterpolations. Letusetheterm“temportalscope”fortheoveralltimeofacoarticulationphenomena, whichmeansthatthestateofthemouthisdependingonthistimeintervalofthespeech signal.IndirectATVSthecalculationofaframeisbasedonthisaudiosignalinterval. IndiscreteATVSthevisemesandthephonemesaresynchronizedandinterpolationis appliedbetweenthem,asitispopularintexttovisualspeechsystems[22].Figure2.3 showsthisdifference.

Figure2.3: Temporalscopeofdiscrete(interpolating)anddirectATVS

Asymmetry

As mutualinformationestimationresultedanygivenstateofthevideodatastream canbecalculatedfairlyonadeﬁnablerelativetimewindowofthespeechsignal. This modelpredictsthatthetransientphaseofthevisiblespeechcanbecalculatedinthe samewayasinthesteadyphaseasFigure2.3shows.

Thismodelgivesapredictionabouttemporalasymmetriesinthemultimodalspeech process. Thisasymmetrycanbeexplainedwith mentalpredictivityinthe motionof

(25)

2.5.2Evaluation 19

thefacial musclestoﬂuentlyformthenextphoneme. Detailswillfollowinchapter“Temporalasymmetry”. Speakerindependence

Sincethedirectconversionisusuallyanapproximationtrainedonagivensetofaudio andvideostates,itsuﬀersheavydependenceonthedatabase. AsIdetailedbefore,for agooddirectATVSagoodlip-speakerneededtosharevisualdatawiththesystem. Talentedlip-speakersarerare,and mostoftheexperiencedlip-speakersare women. Thismeansthatasinglerecordingofonelip-speakergivesnotonlyaspeakerdependent system,butcollecting moreprofessionallip-speakerswouldresultagenderdependent system,sincethestatisticsofthedatawouldheavilybiased,oritisverydiﬃcultto collectenough malelip-speakertothesystem.

Evenifwewouldhaveplentyofprofessionallip-speakers,thereisaquestionabout the mixingthevideodata. Peoplearticulatediﬀerently.Itisnotguaranteedthata mixtureofgoodarticulationsresultevenanacceptablearticulation. The mostsafe solutionistochooseoneofthelip-speakersasaguaranteedhighqualityarticulation, andtryingtousehis/herperformancewith multiplevoices.

Iwillgiveasolutionforthisprobleminchapter“Speakerindependenceindirect conversion”.

2.5.2 Evaluation

Thebasesystemwaspublishedasastandalonesystem,andwasmeasuredwithsubjec- tiveopinionscoresandintelligibilitytestswithdeafpersons.In mychapteraboutthe comparisonofAV mappings,IwillpositionthedirectATVSamongtheothersusedin theworld.

Oddlytherearequitefewpublicationsondirect ATVS. Thisisstrange,because thesystemisoneofthe mostsimpledesigns.Let metellapersonalexperiencefroma conferenceofEUSIPCO,Florence. AyoungresearcherwasinterestedinourJohnnie demo. He wasfrom ATR,Japan,andhepraisedoursystem. AsIexplainedthe workflowofthesystem,oneachstagehesaid“Wedidthesame”. Eventhenumber of PCAcoefficients wasthesame. Attheend,hesaidthattheirsystemproduce significantlyworseresultsthanours,itwasnotevenpublishedbecauseitwasflawed. Weagreedthenthatthe mostimportantdifferenceisthelip-speaker’sprofessionality.

Hisworkcanbereadinjapanese[23]intheannualreportoftheinstitute,bythe wayinthesameyearwepublishedourresultsinHungarian[24,25,26]

Anotherexampleofdirectconversionpublicityisacomparisonstudyofanunpub- lisheddirectsystem[27]usedasinternalbaseline.

Becauseofthisunderpublicity,itisimportanttopositiondirectATVSamongthe morepopularmodularATVSsystems,sincemostofthevisualspeechsynthesisresearch groupsalsotrytoimplementadirectATVS,buttheireﬀortsfailbecauseofthequality ofthedatabase. Attheﬁrstglancethis mayseemasbadnewsforourresearch,but thenoveltyofoursystemisstillunharmedsincetheworkinATRwasidenticalonlyin

(26)

thetechnicaldetails,andatrainingsystem’stechnologyitself,withoutthedatabaseis notawholesystem. Ourbasesystemisnewbecauseofthenewtrainingdata,andthe ﬁndingoftheneedoftheprofessionallip-speaker. Again,Iwouldliketoemphasizethat diﬀerencebetweenourbasesystemandtheonedevelopedin ATRisnot“only”the databasebutthetrainingstrategy,whichisoneofthemostimportantandfundamental partofanylearningsystem.

Thisnewandsuccessfulllearningstrategy makesourbasesystemnovel,butin thisthesisIfocustheresultsof myown,nottheresearchgroup. Johnnie Talker isacontributionofthegroup,andthefollowingextensionsand measurementsare contributionoftheauthorofthisthesis.

InthenextchapterIwillshowhowthebasesystemwiththeessentialdatabaseof theprofessionallip-speakercanberankedamongthewidelyusedATVSsystems.

20

(27)

Chapter3

Naturalnessofd irectconvers ion

InthischapterIdiscussthemeasurementofthenaturalnessofsyntheticvisualspeech, andcomparisonofdiﬀerentAV mappingapproaches.

3 .1 Method

Acomparativestudyofaudio-to-visualspeechconversionisdescribedinthischapter. Thedirectfeature-basedconversionapproachiscomparedtovariousindirect ASR- basedsolutions. Thealreadydetailedbasesystemwasusedasdirectconversion. The ASRbasedsolutionsarethemostsophisticatedsystemsactuallyavailableinHungarian. The methodsaretestedinthesameenvironmentintermsofaudiopre-processing andfacial motionvisualization. Subjectiveopinionscoresshowthat withrespectto naturalness,directconversionperformswell. Conversely,withrespecttointelligibility, ASR-basedsystemsperformbetter.

ThethesisabouttheresultsofthecomparisonisimportantbecausenoAV map- pingcomparisons weredonebefore withthenoveltraningdatabaseofprofessional lip-speaker.

3.1.1 Introduction

Adifficultythatarisesincomparingthedifferentapproachesisthattheyusuallyarede- velopedandtestedindependentlybytherespectiveresearchgroups. Different metrics areused,e.g.intelligibilitytestsand/oropinionscores,anddifferentdataandview- ersareapplied[28].InthischapterIdescribeacomparativeevaluationofdifferent AV mappingapproacheswithinthesameworkflow,seeFigure3.1. Theperformance ofeachis measuredintermsofintelligibility, wherelip-readabilityis measured,and naturalness,whereacomparisonwithrealvisualspeechis made.

3.1.2 Audio-to-visual Conversion

Theperformanceofﬁvediﬀerentapproacheswillbeevaluated. Thesearesummarized asfollows:

21

(28)

22 3. NATURALNESS OF DIRECT CONVERSION

Figure3.1: Multipleconversion methodsweretestedinthesameenvironment

(29)

3.1.2 Audio-to-visual Conversion 23

Figure3.2:Structureofdirectconversion. - Areferencebasedonnaturalfacial motion.

- Adirectconversionsystem.

- AnASRbasedsystemthatlinearinterpolatesphonemic/visemictargets. - AninformedASR-basedapproachthathasaccesstothevocabularyofthetest

material(IASR).

- AnuninformedASR(UASR)thatdoesnothaveaccesstothetextvocabulary. Thesearedescribedin moredetailinthefollowingsections.

Directconversion

Weusedourbasesystem,withadatabaseofaprofessionallip-speaker. Thelengthof therecordedspeechwas4250frames.

(30)

ASR-basedconversion

FortheASRbasedapproachesa WeightedFiniteStateTransducer —Hidden Markov- Model(WFST-HMM)decoderisused.Speciﬁcally,asystemknownasVOXerver[29] isused, whichcanruninoneoftwo modes:informed,thisexploitsknowledgeof thevocabularyofthetestdata,anduninformed,whichdoesnot.Incomingspeechis convertedto MFCCs,afterwhichblindchannelequalizationisusedtoreducelinear distortioninthecepstraldomain[30]. Speakerindependentcross-worddecision-tree basedtriphoneacoustic modelsareapplied, whichpreviouslyaretrainedusingthe MRBAHungarianspeechdatabase[31],whichisastandardized,phoneticallybalanced HungarianspeechdatabasedevelopenontheBudapestUniversityofTechnologyand Economics.

Theuninformed ASRsystemusesaphoneme-bigramphonotactic modeltocon- strainthedecodingprocess. Thephoneme-bigramprobabilitieswereestimatedfrom the MRBAdatabase.IntheinformedASRsystemazerogramwordlanguage model isused withavocabularysizeof120 words. Wordpronunciations weredetermined automaticallyasdescribedin[32].

Inbothtypesofspeechrecognitionapproachesthe WFST-HMMrecognitionnet- work wasconstructedoﬄineusingthe AT&TFSMtoolkit[33]. Inthecaseofthe informedsystem,phonemelabelswereprojectedtotheoutputofthetransducerin- steadofwordlabels. Theprecisionofthesegmentationis10 ms.

Visemeinterpolation

Tocomparethedirectandindirectaudio-to-visualconversionsystems,astandard approachforgeneratingvisualparametersistofirstconvertaphonemetoitsequivalent visemeviaalookuptable,thenlinearinterpolatethevisemetargets. Thisapproachto synthesizingfacial motionisoversimplifiedbecausecoarticulationeffectsareignored, butitdoesprovideabaselineonexpectedperformance(worst-casescenario).

Modular ATVS

Toaccountforcoarticulationeffects,a moresophisticatedinterpolationschemeisre- quired.Inparticulartherelativedominanceofneighboringspeechsegmentsonthe articulatorsisrequired. Speechsegmentscanbeclassifiedasdominant,uncertainor mixedaccordingtothelevelofinfluenceexertedonthelocalneighborhood. Tolearn thedominancefunctionsanellipsoidisfittedtothelipsofspeakersinavideosequence articulating Hungariantriphones. Toaidthefitting,thespeakers wearadistinctly coloredlipstick. Dominancefunctionsareestimatedbythevarianceofvisualdatain agivenphoneticneighborhoodset. Thelearneddominancefunctionsareusedtoin- terpolatebetweenthevisualtargetsderivedfromthe ASRoutput[34]. Weusethe implementationofLászlóCzapandJános MátyásherewhichproducesPoserscript. FAPsareextractedfromthisformatbythesameworkflowasfromanoriginalrecording.

(31)

3.1.3Evaluation 25

Figure3.3: ModularATVSconsistsofanASRsubsystemandatexttovisualspeech subsystem.

Rendering Module

Thevisualizationoftheoutputofthe ATVS methodsiscommontoallapproaches. Theoutputfromthe ATVS modulesarefacialanimationparameters(FAPs), which areappliedtoacommonhead modelforallapproaches. Note,althoughbetterfacial descriptorsthan MPEG-4areavailable, MPEG-4isusedherebecauseour motion capturesystemdoesnotprovide moredetailthanthis. Therenderedvideosequences arecreatedfromtheseFAPsequencesusingthe Avisynth[35]3Dfacerenderer. As themaincomponentsfortheframeworkarecommonbetweenthedifferentapproaches, anydifferencesareduetothedifferencesintheAV mapping methods. Actualframes areshownonFig3.5.

3.1.3 Evaluation

Implementationspeciﬁcnoncriticalbehavior(eg. articulationamplitude)shouldbe normalizedtoensurethatthecomparisonisbetweentheessentialqualitiesofthe methods. Todiscoverthesediﬀerences,apreliminarytestisdone.

Preliminarytest

Totunetheparametersofthesystems,7videos weregeneratedbyeachoftheﬁve mapping methods,andsomesequences werere-synthesizedfromtheoriginalfacial motiondata. Allsequencesstartedandendedwithaclosed mouth,andeachcontained between2-4words. Thespeakerparticipatedinallofthetestswasnotoneofthosewho

(32)

Table3.1: Resultsofpreliminarytestsusedtotunethesystemparameters.Shownare theaverageandstandarddeviationofscores.

Method Averagescore STD UASR 3.82 0.33 Original 3.79 0.24 Linear 3.17 0.4 Direct 3.02 0.41

IASR 2.85 0.72

wereinvolvedintrainingoftheaudio-to-visual-mapping. Thevideoswerepresentedin arandomizedorderto34viewerswhomwereaskedtoratethequalityofthesystems usinganopinionscore(1–5). TheresultsareshowninTable3.1.

Theresultswereunexpected,theIASR,whichusesamoresophisticatedcoarticula- tionmodel,wasexpectedtobeoneofthebestperformingsystems. Closerinvestigation ofthelowerscoresshowedthereasonwasratherduetopooreraudiovisualsynchrony ofIASRthanforUASR.Thereasonofthisphenomenaisthediﬀerenceofthe mecha- nismoftheinformedandtheuninformedspeechrecognitionprocess. Duringinformed recognitionthetiminginformationisproducedasaconsequenceofthealignmentof thecorrectphonemestothesignal, whichpressesthesegmentboundariesbyusing thecertainphoneticinformation. Theuninformedrecognition may miscategoriesthe phonemebuttheacousticalchangesarethedriverofthesegmentboundaries,sothe resultingsegmentationisclosertotheacousticallyreasonablethanthephonetically drivensegmentation.

Aqualitativediﬀerencebetweenthedirectandindirectapproachesisthedegree of mouthopening —thedirectapproachtendedtoopenthe mouthonaverage30%

morethantheindirectapproaches. Consequently,tobringthesystemsintothesame dynamicrange,the mouthopeningforthedirect mappingwasdampedby30%. The synchronyoftheASR-basedapproacheswascheckedforsystemicerrors(constantor linearlyincreasingdelays)usingcrosscorrelationoflocallytimeshiftedwindows,but nosystematicpatternsoferrorsweredetected.

3.1.4 Results ASRsubsystem

ThequalityoftheASR-basedapproachisaﬀectedbytherecognizedphonemestring. Thistypicallyis100%fortheinformedsystemasthetestsetconsistsonlyofasmall numberofwords(monthsoftheyear,daysoftheweek,andnumbersunder100),whilst theuninformedsystemhasatypicalerrorrateof25.21%. Despitethisthe ATVS usingthisinputperformssurprisinglywell. Thelikelyreason mightbethepatternof confusions —oftenphonemesthatareconfusedacousticallyappearvisuallysimilaron thelips. AsecondfactorthataﬀectstheperformanceoftheASR-basedapproachesis precisionofthesegmentation. Generallytheuninformedsystemsare morepreciseon

(33)

3.1.4 Results 27

0 200 400 600 800 1000 1200 1400 1600

−5005000

−5005000 0

ms

FAP

Trajectories of different methods

Direct IASRUASR Linear

Figure 3.4: Trajectory plot of diﬀerent methodsforthe word “Hatvanh´arom”

(hOtvOnha:rom).Jawopeningandlipopeningwidthisshown. Notethatthespeaker didnotpronouncetheutteranceperfectly,andtheinformedsystemattemptstoforce a matchwiththecorrectlyrecognizedword. Thisleadstotimealignmentproblems.

theaveragethantheinformedsystems. Theprecisionofthesegmentationcanseverely impactonthesubjectiveopinionscores. Wethereforeﬁrstattempttoquantifythese likelysourcesoferror.

Theinformedrecognitionsystemissimilarinnaturetoforcedalignmentinstandard ASRtasks.Foreachutterancetherecognizerisruninforcedalignment modeforallof thevocabularyentries. The maindiﬀerencebetweentheinformedandtheuninformed recognitionprocessisthediﬀerent Markovstategraphsforrecognition. Theinformed systemisazerogramwithoutloopback,whiletheuninformedgraphisabigram model graphwheretheprobabilitiesoftheconnectionsdependonlanguagestatistics.

While matchingtheextractedfeatureswiththe Markovianstates,thediﬀerences arecumulatedinbothscenarios. However,theuninformedsystemallowsfordiﬀerent phonemesoutsideofthevocabularytominimizethecumulatederror.Fortheinformed systemonlythe mostlikelysequenceisallowed, whichcandistortthesegmentation

—seeFigure3.4foranexample wherethespeaker mispronouncesthe word“Hat- vanh´arom”(hOtvOnha:rom,“63”inHungarian). The(mis)segmentationofOtvOmeans IASRAVTSsystemopensthe mouthaftertheonsetofthevowel. Humanperception issensitivetothiserrorandsothisseverelyimpactstheperceivedquality. Without forcingthevocabulary,asystemmayignoreoneoftheconsonantsbutopenthemouth atthecorrecttime.

Notethatthegeneralizationofthisphenomenaisoutofthescopeofthiswork. We havedemonstratedthatthisisaproblemwithcertainimplementationsofHMM-based ASR.Alternative, morerobustimplementations mightalleviatetheseproblems.

(34)

Table3.2: Resultsofopinionscores,averageandstandarddeviation. Method Averagescore STD

Originalfacial motion 3.73 1.01 Directconversion 3.58 0.97

UASR 3.43 1.08

Linearinterpolation 2.73 1.12

IASR 2.67 1.29

Subjectiveopinionscores

Thetestsetupissimilartothepreliminarytestdescribedpreviouslytotunethesystem. However,58viewersareused,andonlyquantativeopinionsurveywasmadeonthescale of1(bad,veryartiﬁcal)to5(realspeech).

TheresultoftheopinionscoretestisonTable3.2. Theadvantageofdirectcon- versionagainst UASRisontheedgeofsignificance withp=0.0512as wellasthe differencebetweentheoriginalspeechandthedirectconversion withp=0.06but UASRissignificantlyworsethanoriginalspeechwithp=0.00029. Theresultscom- paredtothepreliminarytestalsoshowthatwithrespecttonaturalness,theexcessive articulationisnotsignificant. Theadvantageofcorrecttimingovercorrectphoneme stringisalsosignificant.

NotethatthelinearinterpolationsystemisexploitingbetterqualityASRresults, butstillperformssigniﬁcantlyworsethantheaverageofotherASRbasedapproaches. Thisshowstheimportanceofcorrectlyhandlingvisemedominanceandvisemeneigh- borhoodsensitivityinASRbasedATVSsystems.

Intelligibility

Intelligibilitywasmeasuredwithatestofrecognitionofvideosequenceswithoutsound. Thisisnotthepopular Modiﬁed RhymeTest[36]butforourpurposeswithhearing impairedviewersitis morerelevant,sincethekeywordspottingisthe mostcommon lip-readingtask. The58testsubjectshadtoguesswhichwordwassaidfromagivenset of5otherwordsofthesamecategory. Thecategorieswerenumbers,namesof months andthedaysoftheweek. Allthewordsweresaidtwice. Thesetswereintervalsto eliminatethe memorytestfromthetask(forexample“2”,“3”,“4”,“5”,“6”canbe aset). Thistask modelsthesituationofhearingimpairedorverynoisyenvironment whereanATVSsystemcanbeused.Itisassumedthatthecontextisknown,sothe keywordspottingistheclosesttasktotheproblem.

Theperformanceoftheaudio-to-visualspeechconversion methodsreverseinthis taskcomparedtonaturalness. The mainresulthereisthedominanceof ASRbased approaches(Table3.3),andtheinsigniﬁcanceofthediﬀerencebetweeninformedand uninformed ATVSresults(p=0.43)inthistest which maydeservefurtherinvesti- gation. Notethatasthesynchronyisnotanissue withoutvoice,theIASRisthe best.

(35)

3.1.5 Conclusion 29

Table3.3: Resultsofrecognitiontests,averageandstandarddeviationofsuccessrate inpercent. Randompickwouldgive20%.

Method Precision STD

IASR 61% 20%

UASR 57% 22%

Original motion 53% 18%

Cartoon 44% 11%

Directconversion 36% 27%

Table3.4: Comparisontotheresultsof ¨OhmanandSalvi[27],aHMMandrulebased systemsintelligibilitytest.Intelligibilityofcorresponding methodsaresimilar.

Methods Prec. Prec. IASR/Ideal 61% 64%

UASR/HMM 57% 54%

Direct/ANN 36% 34%

Asacomparisonwith[27]whereintelligibilityistestedsimilarly, manuallytuned optimalrulebasedfacialparametersareclosetoourIASRsincetherewasnorecognition error,andwithoutvoicethetimealignmentqualityisnotimportant,andourTTVS isrulebased. Their HMMtestissimilartoour UASR,becausebothare without vocabulary,botharetargetingtimealignedphonemestringtobeconvertedtofacial parameters,andourASRisHMMbased. TheirANNsystemisveryclosetoourdirect conversionexceptthetrainingset,itisastandardspeechdatabaseaudio,andarule basedcalculatedtrajectoryvideodata,whileoursystemistrainedonactualrecording ofaprofessionallip-speaker. Howevertheresultsconcerningintelligibilityarecloseto eachother,seeTable3.4. Thisisavalidationoftheresults,sincethecorresponding measurementareclosetoeachother.Itisimportantthat[27]testsonlyintelligibility, andonlybetweenthree methodsofours,soour measurementisbroader.

3.1.5 Conclusion

Ipresentedacomparativestudyofaudio-to-visualspeechconversion methods. We havepresentedacomparisonofourdirectconversionsystemwithconceptuallydiﬀerent conversionsolutions. Asubsetoftheresultscorrelatewithalreadypublishedresults, validatingtheapproachofthecomparison.

WeobservehigherimportanceofthesynchronyoverphonemeprecisioninanASR based ATVSsystem. Therearepublicationsonthehighimpactofcorrecttimingin diﬀerentaspects[34,37,38],butourresultshowexplicitlythat moreaccuratetim- ingachieves muchbettersubjectiveevaluationthan moreaccuratephonemesequence. Also,wehaveshownthatintheaspectofsubjectivenaturalnessevaluation,directconversion(trainedonprofessionallip-speakerarticulation)isa methodwhichproduces thehighestopinionscoreof95.9%ofanoriginalfacial motionrecording withlower

(36)

computationalcomplexitythanASRbasedsolutions.

Fortasks whereintelligibilityisimportant(supportforhearingimpaired,visual informationinnoisyenvironment) modular ATVSisthebestapproachamongthose presented. Our missionofaidinghearingimpairedpeoplecalluponustoconsider usingASRbasedcomponents. Fornaturalness(animation,entertainingapplications) directconversionisagoodchoice. Forbothaspects UASRgivesrelativelygoodbut notoutstandingresults.

3.1.6 Technicaldetails

Markertrackingwasdonefor MPEG-4FP8.88.48.68.18.58.38.78.25.29.29.3 9.15.12.102.1. Duringsynthesis,allFAPs(MPEG-4Facial AnimationParameter) connectedtheseFPswereusedexceptdepthinformation:

•openjaw

•lowertmidlip

•raisebmidlip

•stretchlcornerlip

•stretchrcornerlip

•lowertliplm

•lowertliprm

•raisebliplm

•raisebliprm

•raiselcornerlip

•raisercornerliplowertmidlip o

•raisebmidlip o

•stretchlcornerlipo

•stretchrcornerlipo

•lowertliplmo

•lowertliprmo

•raisebliplmo

•raisebliprmo

•raiselcornerlipo

•raisercornerlipo Innerlipcontourisestimatedfromouter markers.

Yellowpaint wasusedto marktheFPlocationsonthefaceoftherecordedlip- speaker. Thevideorecordingis576iPAL(576x720pixels,25frame/sec,24bit/pixel). Theaudiorecordingis mono48kHz16bitinasilentroom. Furtherconversionswere dependedontheactual method.

Markertrackingwasbasedoncolor matchingandintensitylocalizationframeto frameandthelocationwasidentiﬁedbytheregion.Inoverlappingregionstheclosest locationonthepreviousframewasusedtoidentifythe marker. Aframewithneutral facewasselectedtouseasthereferencetoFAPU measurement. The markeronthe noseisusedasreferencetoeliminatehead motion.

ThedirectconversionusesamodiﬁcationofDavideAnguita’s MatrixBackpropaga- tionwhichenablesreal-timeworkalso. Theneuralnetworkused11framelongwindow ontheinputside(5framestothepastand5framestothefuture),and4principal

(37)

3.1.6 Technicaldetails 31

componentweightsofFAPontheoutput.Eachframeontheinputisrepresentedby16 band MFCCfeaturevector. Thetrainingsetofthesystemcontainsstandalonewords andphoneticallybalancedsentences.

IntheASRthespeechsignalwasconvertedtoafrequencyof16kHz. MFCC(Mel FrequencyCepstralCoeﬃcients)-basedfeaturevectorswerecomputedwithdeltaand delta-deltacomponents(39dimensionsintotal). Therecognitionwasperformedona batchofseparatedsamples. Outputannotationsandthesampleswerejoined,andthe synchronybetweenlabelsandthesignalwaschecked manually.

Thevisemestothelinearinterpolation method wereselected manuallyforeach visemein Hungarianfromthetrainingsetofthedirectconversion. Visemesand phonemeswereassignedbyatable. Eachsegmentisalinearinterpolationfromthe actualvisemetothenextone. LinearinterpolationwascalculatedintheFAPrepre- sentation.

TTVSisaVisualBasicimplementedsystemwithaspreadsheetofthetimedpho- neticdata. ThisspreadsheetwaschangedtotheASRoutput. Neighborhooddependent dominancepropertieswerecalculatedandvisemeratioswereextracted.Linearinterpo- lation,restrictionsconcerningbiologicalboundariesand medianfilteringwereapplied inthisorder. TheoutputisaPoserdatafilewhichisappliedtoa model. Thetexture ofthe modelis modifiedtoblackskinanddifferentlycolored MPEG-4FPlocation markers. Theanimationwasrenderedindraftmode,withthefieldofviewandresolu- tionoftheoriginalrecording. Markertrackingwasperformedasdescribedabovewith theexceptionofthedifferentlycolored markers. FAPUvalueswere measuredinthe renderedpixelspace,andFAPvalueswerecalculatedfromFAPUandtracked marker positions.

ThiswasdoneforbothASRruns,uninformedandinformed.

Thetest materialwas manuallysegmentedto2-4wordunits. Thelengthsofthe unitswerearound3seconds. Thesegmentationboundarieswerelistedandthevideo cutwasautomaticallydonewithanAvisynthscript. Weusedan MPEG-4compatible head modelrendererpluginfor Avisynth, withthe model“Alice”of XFaceproject. Theviewpointandtheﬁeldofviewwasadjustedtohaveonlythemouthonthescreen infrontalview.

Duringthetestthesubjectswatchedthevideosfullscreenandusedheadphones.

(38)

3 .2 Thes is

I.Ishowedthatdirect AV mapping method, whichis moreef- ﬁcientcomputationallythan modularapproaches, overperforms the modular AV mappinginaspectofnaturalness withaspeciﬁc trainingsetofprofessionallip-speaker.[39]

3.2.1 Novelty

ThisistheﬁrstdirectAVmappingsystemtrainedwithdataofprofessionallip-speaker. ComparisontomodularmethodsisinterestingbecausedirectAV mappingstrainedon lowqualityarticulationcanbeeasilyoverperformedby modularsystemsinaspectof naturalnessandintelligibility.

3.2.2 Measurements

Naturalness was measuredassubjectivesimilaritytohumanarticulation. The mea- surementwasblindandrandomized,thenumberoftestsubjectswas58,andourdirect AV mappingwasnotsignificantlyworsethanoriginalvisualspeech,butthedifference betweenthe modularandtheoriginalwassignificant.

Opinionscoreaveragesanddeviationsshownnosignificantdifferencebetweenhu- manarticulationanddirectconversion,butsignificantdifferencebetweenhumanand modular mappingbasedsystems.

The measurement wasdoneon Hungariandatabase,ﬂuentlyreadspeech. The databasecontains mixedisolatedwordsandsentences.

3.2.3 Limitsofvalidity

Testsweredoneonnormalspeechdatabase,withfullyfocusedperceptionofthetest subjectsongoodaudioandvideoqualityvideos.

3.2.4 Consequences

Usingdirectconversionforareaswherenaturalnessis mostimportantisencouraged. Usingprofessionallip-speakertorecordaudiovisualdatabaseincreasesthequalityto becomparablewiththelevelofhumanarticulation. Otherlaboratoriestrainedtheir systems withnon-professionals,andthosesystems werenotpublicatedduetotheir poorperformance.

32

(39)

3.2.4 Consequences 33

Figure3.5: Anexampleoftheimportanceofcorrecttiming.Framesoftheword“Ok- tober”showtimingdiﬀerencesbetween methods. Notethatdirectconversionreceived bestscoreeventhoughitdoesnotclosethelipsonbilabialbutclosesonvelar,andit hasproblemswithliprounding.

(40)

(41)

Chapter4

Temporalasymmetry

InthischapterIdiscussthe measurementofrelevanttimewindowfordirectAV mapping,whichisimportanttobuildaaudiotovisualspeechconversionsystemsincethe temporalwindowofinterestcanbedetermined.

4 .1 Method

Theﬁnetemporalstructureofrelationsofacousticandvisualfeatureshasbeenin- vestigatedtoimproveourspeechtofacialconversionsystem. Mutualinformationof acousticandvisualfeatureshasbeencalculatedwithdiﬀerenttimeshifts. Theresult hasshownthatthe movementoffeaturepointsonthefaceofprofessionallip-speakers canprecedeevenby100msthechangesofacousticparametersofspeechsignal. Con- sideringthistimevariationthequalityofspeechtofaceanimationconversioncanbe improvedbyusingthefuturespeechsoundtotheconversion.

4.1.1 Introduction

Otherresearchprojectsonconversionofspeechaudiosignaltofacialanimationhave concentratedondevelopmentoffeatureextractionmethods,databaseconstructionand systemtraining[40,41].Evaluationandcomparisonofdiﬀerentsystemshavealsohad highimportanceintheliterature.InthischapterIdiscussthetemporalintegrationof acousticfeaturesoptimalforreal-timeconversiontofacialanimation. Thecriticalpart ofsuchsystemsisthebuildingofanoptimalstatistical modelforthecalculationthe videofeaturesfromtheaudiofeatures. Thereinnoknownexactrelationoftheaudio featuresetandvideofeaturesetcurrently,thisisanopenquestionyet.

Thespeechsignalconveysinformationelementsinaveryspeciﬁc way. Someof speechsoundsarerelatedrathertoasteadystateofthearticulatoryorgans,others rathertothetransition movements[42]. Ourtargetapplicationisforprovidinga communicationaidtodeafpeople.Professionallip-speakershave5-6phoneme/sspeech ratetoadaptthecommunicationtothedemandofdeafpeoplesosteadystatephases andthetransitionphasesofspeechsoundsarelongerthenineverydayspeechstyle.

35

(42)

36 4. TEMPORAL ASYMMETRY

Thesignalfeaturestocharacterizeasoundsteadystatephaseora transitionphase oreventocharacterizeaco-articulationphenomenonwhentheneighboringsoundsare highlyinterrelated,needacarefulselectionofthetemporalscopetocharacterizethe speechandvideosignal.Inour modelweselected5analysiswindowstodescribethe actualframeofspeechplustwopreviousandtwosucceededwindowstocover +/-80 msinterval.Sosuch5elementsequenceofspeechparameterscancharacterizetransient soundsandtheco-articulations.

Wehaverecognizedthatatthebeginningofwordsthelip movementsstartearlier thenthesoundproduction. Sometimes100msearlierthelipsstartto movetothe initialpositionofthesounds.Itwasthetaskofthestatistical modeltohandlethis phenomenon.

Intherefinementphaseofoursystemwehavetriedtooptimizethemodelselecting theoptimaltemporalscopeandfittingofaudioandvideofeatures. The measureof thefittinghasbasedonthe mutualinformationofaudioandvideofeatures[43].

Thebasesystemusesanadjustabletemporalwindowofaudiospeechsignal. The neuralnetworkcanbetrainedtorespondtoanarrayof MFCC windows,usingthe futureand/orpastaudiodata. Theconversioncanbeasgoodastheamountofmutual informationbetweentheaudioandvideorepresentations.

Usingthetrainedneuralnetforcalculationofcontrolparametersoffacial animation model

Theaudioprocessingunitextractstheaudio MFCCfeaturevectorsfromtheinput speechsignal. Fiveframesof MFCCvectorsareusedasinputtothetrainedneural net. TheNNprovideFacePCAweightvectors. Theseareconvertedintothecontrolpa- rametersofa MPEG-4standardfaceanimationmodel. Thetestofﬁttingofaudioand videofeatureswasbasedonstep-by-steptemporalshiftingoffeaturevectors.Indicator ofthematchingwas mutualinformation.Lowlevel mutualinformation meansthatwe havelowaveragechancetoestimatethefacialparametersfromtheaudiofeatureset. Thetimeshiftvaluetoproducethehighest mutualinformation meansthe maximal averagechancetocalculatetheonekindofknownfeaturesfromtheotherone.

Estimationof mutualinformationneedscomputationintensivealgorithm. The calculationisunrealisticusinglargedatabase with multidimensionalfeaturevectors. Sosingle MFCPCAandFacePCAparameterswereinterrelated. Sincethesinglepa- rametersareorthogonalbutnotindependent,theyarenotadditive. Forexamplethe FacePCA1valuesarenotindependentfromFacePCA2. Themutualinformationcurves eveninsuchcomplexcasescanindicatetheinterrelationsofparameters.

Analternative methodistocalculatecrosscorrelation. Wehavealsotestedthis method.Itneedslesscomputationalpowerbutsomeofrelationsarenotindicatedso itisalowerestimationoftheoretical maximum.

(43)

4.1.1Introduction 37

Mutualinformation

MIX,Y =

x∈X y∈Y

P(x,y)logP(x,y)

P(x)P(y) (4.1)

MutualinformationishighifknowingXhelpstoﬁndoutwhatisY,anditislowif XandYareindependent.Tousethismeasurementfortemporalscopetheaudiosignal willbeshiftedintimecomparedtothevideo.Ifthetimeshiftedsignalhasstillhigh mutualinformation,it meansthatthistimevalueshouldbeinthetemporalscope.If thetimeshiftistoohigh, mutualinformationbetweenthevideoandthetimeshifted audiowillbelowduetotherelativeindependenceofdiﬀerentphonemes.

Usingaandvasaudioandvideoframes:

∀∆t∈[−1s,1s]:MI(∆t)=ⁿ

t=1

P(at+∆t,vt)log P(a^t+∆t,vt)

P(at+∆t)P(vt) (4.2) where P(x,y)isestimatedbya2dimensionalhistogramconvolved with Gauss window. Gausswindowisneededtosimulatethecontinuousspaceinthehistogram incases whereonlyafewobservationsarethere. Sinceaudioandvideodataare multidimensionaland MIworkswithonedimensionaldata,allthecoeﬃcientvectors wereprocessed,andtheresultsaresummarized. The mutualinformationvalueshave beenestimatedfrom200x200sizejointdistributionhistograms. Thehistogramshave beensmoothedby Gaussian window. The windowhas10cellradius with2.5cell deviation. The marginaldensitydistributionfunctionshavebeencalculatedfromthe sumofjointdistributionfunctions.

MFCPCAandFacePCA measurements

170secondsofaudioandvideospeechrecordswasprocessed. Thetimeshifthasbeen varied1mssteps. Melfrequencycoeﬃcientsarecalculatedforeachelement.Principal componentanalysis(PCA)hasbeenappliedforeven morecompactrepresentation ofaudiofeaturessincePCAcomponentscanrepresenttheoriginalspeechframesby minimalaverageerroratgivensubspacedimensionality.Inthefollowingthespeech framesaredescribedbysuch MFCPCAparameters.

The MFCPCAparametersare morereadablerepresentationofframesforhuman expertsthanPCAof MFCCfeaturevectors.

The MFCPCAparametershavedirectrelationstothespectrum. ThePCAtrans- formationdoesnotconsiderthesignofthetransformedvectors,sotheﬁrst MFCPCA componentshowsenergy-likerepresentationascanbeseeninFig4.1. Foranother examplethesecond MFCPCAcomponenthaspositivevalueinvoicedspeechframes andnegativeinframesoffricativespeechelements.

Theoriginalvideorecordshave40 msframeratesotohavethepossibilityof1 ms stepsizeshifting,theintermediateshiftedframeparametershavebeencalculatedby interpolationandlowpassﬁltering.

(44)

Figure4.1:Principalcomponentsof MFCfeaturevectors.

(45)

4.1.1Introduction 39

Table4.1:Importancerate(variance)ofthe MFCPCA. MFCPCA alone ﬁrstntogether

1 77% 77%

2 10% 87%

3 5% 93%

4 2% 95%

Table4.2:Importancerate(variance)oftheFacePCA. FacePCA alone ﬁrstntogether

1 90% 90%

2 6% 96%

3 2% 98%

4 1% 99%

Audioandvideosignalaredescribedby1 msﬁnestepsizesynchronousframes. Thesignalscanbeshiftedrelatedtoeachotherbyﬁnesteps. Theaudioandvideo representationofthespeechsignalcanbeinterrelatedfrom∆t=-1000msto+1000ms. Suchinterrelationcanbeinvestigatedonlylevelthatasinglevoiceelementhowcan estimatebasedonashiftedvideoelementandviceversaasanaverage.

Ourcalculationisnotabletoexplainthevalueofadditionalinformationofshifted signalcomparedtothe0shiftingvalue.Ifithasanyadditionalinformationitisnot subtracted.Sothecurvesdonotindicatetheneedoftheextensionofthetimescope foreverynon-zerovalue. Rathertheshapeofthecurveandtheshiftvalueofthe maximumhaveaspeciﬁc meaning.

Inthenewcoordinatesystemgeneratedbytheprincipalcomponentanalysisthe coordinatescanbecharacterizedbytheimportancerate. Theimportanceratecan expressthatinthegivendirectionwhichportionofthevariancehasbeenproducedin theoriginalspace. Theimportanceratevaluesinthecaseof MFCPCAtransformation areshowninTable4.1.

TheimportanceratevaluesinthecaseofFacePCAtransformationareshownin Table4.2.

Combiningthetwotablesby multiplicationofthetwovectors,acommonimpor- tanceestimationcanbecalculated. Thevaluesexpressthecontributionofparameter pairstothewhole multidimensionaldata.

Thereallyimportantcurvesarethecombinationsofthe1-4principalcomponents. Theirgeneralimportanceisexpressedbythedarknessofthecurves.Potentialsystem- aticerrorshavebeencarefullychecked. Therealsynchronyoftheaudio-videorecords hasbeenadjustedbasedonexplosivesounds. Thenoiseburstofexplosivesandthe openingpositionoflipsarerealthecharacteristics. Thecheckhavebeenrepeatedat theendoftherecordsalso. Thepossiblesynchronyerrorisbelowonevideoframe (40ms).

(46)

Figure4.2:Shifted1.FacePCAand MFCPCAmutualinformation.Positive ∆tmeans futurevoice. Darknessshowimportance.

4.1.2 Resultsandconclusions

The mutualinformationcurves werecalculatedandplottedforeverypossiblePCA parameterpairintherangeof-1000to1000 mstimeshift. Onlythe mostimportant curvesarepresentedbelowtoshowtherelationofthecomponentshavinghighest eigenvalues. Theearlier movementofthelipsandthe mouthhavebeenobservedin casesofcoarticulationandatthebeginningofwords. Thisdelayhasbeenconsidered asaspecificandnegligibleeffect. Thedelayvaluehasbeenestimatedonly. Our newexperimentsproducedageneralrulewithwelldefineddelayvalues.Someofthe strongestrelationofaudioandvideofeaturesisnotinthesynchronoustimeframes. The mouthstartstoformthearticulationinsomecases100 msearlierandtheaudio parametersfollowitwithsuchdelay.

Thecurvesofmutualinformationvaluesareasymmetricandmovedtowardspositive timeshift(delayinsound). Thismeanstheacousticspeechsignalisabetterprediction basistocalculatethepreviousfaceandlippositionthanthefutureposition. This factisinharmonyofthe mentionedpracticalobservationthatarticulation movement proceedsthespeechproductionatthebeginningofwords. Theexcitationsignalcomes

(47)

4.1.2 Resultsandconclusions 41

Figure4.3:Shifted2.FacePCAand MFCPCAmutualinformation.Positive∆tmeans futurevoice. Darknessshowimportance.

Audiotovisualspeechconversion GergelyFeldhoﬀer