Aud iotov isua lspeechconvers ion
Gerge lyFe ldhoffer
Athesissubmittedforthedegreeof DoctorofPhilosophy
Scientificadviser: Gy¨orgyTak´acs
FacultyofInformationTechnology
InterdisciplinaryTechnicalSciencesDoctoralSchool P´azm´anyP´eterCatholicUniversity
Budapest,2010
Contents
1 Introduction 3
1.1 Definitions... 3
1.1.1 Components ... 4
1.1.2 Qualityissues... 5
1.2 Applications... 6
1.2.1 Synface... 6
1.2.2 Synthesisofnonverbalcomponentsofvisualspeech... 7
1.2.3 Expressivevisualspeech... 7
1.2.4 Speechrecognitiontoday ... 7
1.2.5 MPEG-4 ... 8
1.2.6 Facerendering... 8
1.3 Openquestions... 9
1.4 Theproposedapproachofthethesis... 9
1.5 Relateddisciplines... 10
1.5.1 Speechinversion... 10
1.5.2 Computergraphics... 10
1.5.3 Phonetics... 10
2 Motivationandthebasesystem 11 2.1 SINOSZproject... 11
2.1.1 Apracticalview ... 11
2.2 Lucia... 12
2.3 Thebasesystem... 12
2.3.1 Databasebuildingfromvideodata... 12
2.3.2 Audio... 12
2.3.3 Video... 14
2.3.4 Training... 14
2.3.5 Firstresults... 16
2.3.6 Discussion... 16
2.4 JohnnieTalker ... 17
2.5 Extendingdirectconversion... 17
2.5.1 DirectATVSandco-articulation... 17
2.5.2 Evaluation ... 19
i
ii CONTENTS
3 Naturalnessofdirectconversion 21
3.1 Method... 21
3.1.1 Introduction ... 21
3.1.2 Audio-to-visualConversion... 21
3.1.3 Evaluation ... 25
3.1.4 Results ... 26
3.1.5 Conclusion ... 29
3.1.6 Technicaldetails... 30
3.2 Thesis... 32
3.2.1 Novelty... 32
3.2.2 Measurements ... 32
3.2.3 Limitsofvalidity... 32
3.2.4 Consequences... 32
4 Temporalasymmetry 35 4.1 Method... 35
4.1.1 Introduction ... 35
4.1.2 Resultsandconclusions... 40
4.1.3 Multichannel MutualInformationestimation... 44
4.1.4 Durationofasymmetry ... 46
4.2 Thesis... 47
4.2.1 Novelty... 47
4.2.2 Measurements ... 47
4.2.3 Limitsofvalidity... 48
4.2.4 Consequences... 48
5 Speakerindependenceindirectconversion 51 5.1 Method... 51
5.1.1 Introduction ... 51
5.1.2 Speakerindependence... 53
5.1.3 Conclusion ... 55
5.2 Thesis... 56
5.2.1 Novelty... 56
5.2.2 Measurements ... 56
5.2.3 Limitsofvalidity... 56
5.2.4 Consequences... 56
6 Visualspeechinaudiotransmittingtelepresenceapplications 57 6.1 Method... 57
6.1.1 Introduction ... 58
6.1.2 Overview... 58
6.1.3 Face model... 59
6.1.4 Visemebaseddecomposition ... 60
6.1.5 Voicerepresentation... 63
6.1.6 Speexcoding... 63
6.1.7 Neuralnetworktraining... 64
6.1.8 Implementationissues... 65
6.1.9 Results ... 65
6.1.10 Conclusion ... 68
6.2 Thesis... 68
6.2.1 Novelty... 70
6.2.2 Measurements ... 70
6.2.3 Consequences... 70
Acknow ledgements
I wouldliketothankthehelpof mysupervisor Gy¨orgy Tak´acs,and my actualandformercolleaguesAttilaTihanyi,Tam´asB´ardi,Tam´asHarczos, B´alintSrancsik,andBal´azsOroszi.Iamthankfulformydoctoralschoolfor providingtoolsandcaringenvironmentto mywork,especiallypersonally forJuditNy´eky-GaizlerandTam´asRoska.
IamalsothankfulforIv´anHeged˝us,GergelyJung,J´anosV´ıg, M´at´eT´oth, G´abor D´aniel“Szasza”Szab´o, Bal´azs B´anyai, L´aszl´o M´esz´aros, Szilvia Kov´acs,SoltBucsiSzab´o,AttilaKrebszand M´artonSelmecistudents,who participatedinourresearchgroup.
Myworkwouldbelesswithoutthediscussionswithvisualspeechsynthesis expertsasL´aszl´oCzap,TakaashiKuratateandSashaFagel.
IwouldliketothankalsomyfellowPhDstudentsandfriends,especiallyto B´ela Weiss, GergelySo´os, ´Ad´am R´ak,Zolt´anFodr´ozci, Gaurav Gandhi, Gy¨orgy Cserey, R´obert W´agner, Csaba Benedek, Barnab´as Hegyi, ´Eva Bank´o,Krist´ofIv´an,G´aborPohl,B´alintSass, M´arton Mih´altz,FerencLom- bai,NorbertB´erci,´AkosTar,J´ozsefVeres,Andr´asKiss,D´avidTisza,P´eter Vizi,Bal´azsVarga,L´aszl´oF¨uredi,BenceB´alint,L´aszl´oLaki,L´aszl´oL˝ovei andJ´ozsef Mihaliczafortheirvaluablecommentsanddiscussions.
Ithanktheendlesspatienceandhelpfulnessto Mrs Vida,L´ıvia Adorj´an, MrsHaraszti, MrsK¨ormendy,GabriellaRumi, MrsTihanyiandMrs Mikesy. Ialsothankthesupportofthetechnicalstaffoftheuniversity,especially P´eterTholt,Tam´asCsillagandTam´asRec.
AndfinallybutnotleastIwouldliketothankthepatientandlovingsupport of mywifeBernadettand myfamily.
Abstract
Inthisthesis,Iproposenewresultsinaudiospeechbasedvisualspeechsynthesis, whichcanbeusedashelpforhardofhearingpeopleorincomputeraidedanimation. Iwilldescribeasynthesistoolwhichisbasedondirectconversionbetweenaudioand video modalities.I willdiscussthepropertiesofthissystem, measuringthespeech qualityandgivesolutionsforoccurrentdrawbacks.I willshowthatusingadequate trainingstrategyiscriticalfordirectconversion. AttheendIconcludethatdirect converisoncanbeusedaswellasotherpopularaudiotovisualspeechconversions,and itiscurrentlyignoredundeservedlybecauseofthelackofefficienttraining.
1
Chapter1
Introduct ion
Audiotovisualspeechconversionisanincreasinglypopularapplicableresearchfield today. MainconferencessuchasInterspeechorEurasipstartednewsectionsconcerning multimodalspeechprocessing,Interspeech2008heldaspecialsessiononlyforaudioto visualspeechconversion.
Possibleapplicationsofthefieldarecommunicationaidingtoolsfordeafandhard ofhearingpeople[1]bytakingadvantageofthesophisticatedlip-readingcapabilities ofthesepeople,orlip-syncapplicationsintheanimationindustry,incomputeraided animationsaswellasinreal-timetelepresencebasedvideogames.InthisthesisIwill describesolutionsforbothoftheseapplications.
InthischapterI willshowtheactualstatusofthetopic, motivationsandstate ofthearttechniques. Tounderstandthischapterbasicspeechandsignalprocessing knowledgeisneeded.
1 .1 Defin it ions
Speechisa multimodalprocess. The modalitiescanbeclassifiedasaudiospeechand visualspeech.Iwillusethefollowingterms:
Visualspeech isarepresentationoftheviewofafacetalking.
Visualspeechdata isthe motioninformationofvisiblespeechorgansinanyrepre- sentation.
Phonemeisthebasic meaningdistinctivesegmentalunitoftheaudiospeech.Itis languagedependent.
Viseme isthebasic meaningdistinctivesegmentalunitofthevisualspeech. Also languagedependent.Therearevisemesbelongingtophonemes,andtherearephonemes whichdonothavevisemeinparticular,becausethephonemecanbepronouncedwith morethanonewaysofarticulation.
Automaticspeechrecognition(ASR) isasystemora method whichcanextract phoneticinformationfromaudiospeechsignal. Usuallyaphonemestringisproduced.
Audiotovisualspeech(ATVS)conversionsystemsaretocreateananimationofa faceaccordingtoagivenaudiospeech.
3
4 1.INTRODUCTION
Figure1.1: Taskofaudiotovisualspeech(ATVS)conversion.
DirectATVSisanATVSwhich mapsaudiorepresentationtovideorepresentation byapproximation.
DiscreteATVSisanATVSwhichusesclassificationintodiscretecategoriesinorder toconnectthe modalities. Usuallyphonemesandvisemesareused.
ModularATVS isanATVSwhichcontainsASRsubsystem,andphoneme-viseme mappingsubsystem. ModularATVSsystemsareusuallydiscrete.
AV mapping isaninput-output methodwheretheinputisaudiodatainanyrep- resentationandtheoutputisvisualdatainanyrepresentation.Inadiscrete ATVS, AV mappingisaphoneme-viseme mapping,inadirectATVSthisisanapproximator. 1.1.1 Components
EachATVSconsistoftheaudiopreprocessor,theAV(audiotovideo) mapping,and thefacesynthesizer. The moststraightforward methodisthejaw-openingdrivenby speechenergy,thissystemiswidelyusedinon-linegames,sotheaudiopreprocessorisa frame-by-frameenergycalculationexpressedindB,theAVmappingisalinearfunction, which mapstheonedimensionalaudiodatatotheonedimensionalvideoparameter, thejawopening. Theface modelisusuallyavertexarrayoftheface,andbymodifying theverticesofthejawthefacesynthesisisdone.Inbelow moresophisticatedcases willbedetailedwherenaturalnessandintelligibilityareissues.
Recentresearchactivitiesareonspeechsignalprocessing methodsspeciallyfor lip-readablefaceanimation[2],facerepresentationandcontroller method[3],andcon- vincinglynaturalfacialanimationsystems[4].
1.1.2 Qualityissues 5
Audiopreprocessing
Thesesystemsusefeatureextraction methodstogetusefulandcompactinformation fromthespeechsignal. The mostimportantaspectsofqualityherearetheextracted representationdimensionalityandcoveringerror. Forexamplethespectrumcanbe approximatedbyafewchannelsof melbandsreplacingthespeechspectrumwitha certainerror.Inthiscasethedimensionalityisreducedgreatlybyallowingcertainnoise intherepresenteddata. Databasesforneuralnetworkshavetoconsiderdimensionality asaprimaryaspect.
Audiopreprocessing methodscanbeclusteredin manyaspectsastimedomainor frequencydomainfeatureextractors,approximationorclassification,etc. Adeeper analysisofaudiopreprocessing methodsconcerningaudiovisualspeechispublished by[5]resultingthe mainapproachesareapproximatelyequallywell. Thesetraditional approachesarethe MelFrequencyCepstralCoefficients(MFCC)andLinearPrediction Coding(LPC)based methods. AquiteconvenientpropertyofLPCbasedvocaltract estimationisthedirectconnectiontothespeechorgansviathepipeexcitation model. ItseemstobeagoodideatousevocaltractforATVSaswellbutaccordingto[5]it hasnotsignificantly moreusabledata.
AV mapping
Inthisstepthe modalitiesareconnected,visualspeechdataisproducedfromaudio data.
Therearedifferentstrategiesforperformingthisaudiotovisualconversion. Oneap- proachistoexploitautomaticspeechrecognition(ASR)toextractphoneticinformation fromtheacousticsignal. Thisisthenusedinconjunctionwithasetofcoarticulation rulestointerpolateavisemicrepresentationofthephonemes[6,7]. Alternatively,a secondapproachistoextractfeaturesfromtheacousticsignalandconvertdirectly fromthesefeaturestovisualspeech[8,9].
Facesynthesis
Inthisstepthevisualspeechrepresentationisappliedtoaface model. Usuallysepa- ratedtotwoindependentpartsasfacialanimationrepresentationandmodelrendering. Thefacerepresentation mapsthequantitativedataonfacedescriptors. Anexampleof thisisthe MPEG-4standard.Facerenderingisamethodtoproducepictureoranima- tionfromfacedescriptions. Theseareusuallycomputergraphicsrelatedtechniques. 1.1.2 Qualityissues
AnATVScanbeevaluatedondifferentaspects.
-naturalness: how muchisthesimilarityoftheresultofthe ATVSanda realpersonsvisualspeech
6 1.INTRODUCTION
-intelligibility:howtheresulthelpsthelip-readertounderstandthecontent ofthespeech
-complexity:thesystemsoveralltimeandspacecomplexity
-trainability:howeasyistoenhancethesystem’sotherqualitiesbyexam- ples,isthisprocessfastorslow,isitadaptableorfixed
-speakerdependency:howthesystemperformancevariesbetweendifferent speakers
-contextdependency: howthesystemperformancevariesbetweenspeech contents(eg. asystem, whichtrainedon medicalcontent, mayperform pooreronfinancialcontent)
-languagedependency: howcomplexistoportthesystemtoadifferent language. Thereplacementofthedatabasecanbeenough,or mayrules havetobechanged,eventhepossibilitycanbequestionable.
-acousticalrobustness:howthesystemperformancevariesindifferentacous- ticenvironments,likehighernoise.
1 .2 App l icat ions
InthissectionIdescribesomeoftherecentsystems,andgiveashortdescriptionof theminthequalityaspectsdetailedabove.
1.2.1 Synface
Anexampleof ATVSsystemsistheSynface[1]of KTH,Sweden. Thissystemisde- signedforhearingimpairedbutnotdeafpeopletohandlevoicecallsonphone. Thesys- temconnectsthephonelinetoacomputer,whereaspeechrecognitionsoftwaretrans- latestheincomingspeechsignaltoatimealignedphonemesequence. Thisphoneme sequenceisthebasisoftheanimationcontrol. Eachphonemeisassignedtoaviseme, andtherecognizedsequence makesastringofvisemestoanimate. Thespeechrecog- nitionsubsystemnotjustrecognizesthephonemebut makesthesegmentationalso. Thevisemesequencetimedbythissegmentationinformationgivesthefinalresultof theAV mapping,usingarule-basedstrategy. Therulesetiscreatedbyexamplesof Swedish multimodalspeech.
Thissystemisdefinitelylanguagedependent,itusestheSwedishphonemeset, aSwedish ASR,andarulesetbuiltonSwedishexamples. Ontheotherhandthe systemperformsvery wellinaspectsofintelligibility,acousticalrobustness,speaker andcontextdependency.
1.2.2Synthesisofnonverbalcomponentsofvisualspeech 7
1.2.2 Synthesisofnonverbalcomponentsofvisualspeech
Anexampleofaudiotovisualnon-verbalspeechestimationisthesystemof Gregor HoferandHiroshiShimodaira[9]. Theirsystemtargetstoextractthecorrecttimeof blinkinspeech. Theaudiopreprocessinginthissystemconcentratesonnon-verbal components,suchasrhythm,andintonation. Comparedtoactualvideos,theoriginal audiosignalwasusedtotesttheprecisionoftheestimation,whichwasabove80%with adecenttimetolerationof100 ms.Itisimportantthattherearetwokindsofblink, oneoftheregulareyecare,fastblink,andtheotheristhenon-verbalvisualspeech componentemphasizedblink. Ofcoursethisworkwasfocusedthesecondvariant.
1.2.3 Expressivevisualspeech
Thisfieldchangedthenamefrom“Emotionalspeech”to“Expressivespeech”because ofpsychologicalreasons.Expressivespeechtargetstosynthesizeorrecognizeemotional expressionsinspeech.Expressingemotionsisveryrelevantinvisualspeech.
Ishowtwoapproachestothefield.PietroCosielalworkonthevirtualhead“Lucia”
[10]toconnectanexpressiveaudiospeechsynthesizerwithavisualspeechsynthesizer. Thistextbasedsystemcanbeusedasanaudiovisualagentonanyinteractive media wheretextcanbeused.Forexpressivevisualspeechitusesvisemesfortextualcontent andfourbasicemotionalstatesofthefaceasexpressivespeechbasis. Theyworkona naturalblendingfunctionofthesestates.
SashaFagelworksonexpressivespeechinabroadsense[11]. Hecreatedamethodto helpcreatingexpressiveaudiovisualdatabasesbyleadingthesubjectthroughemotional stagestoreachthedesiredlevelofexpressiongradually. Thiswayitispossibletorecord emotionallyneutralcontent(eg.“ItwasonFriday”)articulatedwithjoyoranger. The trickistorecordthesentencemultipletimesandinsertingemotionallyrelevantcontent betweentheoccurrences. Oneexamplecouldbethesequence“Troublehappenalways with me!ItwasonFriday. Whatdoyouthinkyouare?!ItwasonFriday.Ihateyou! It wasonFriday.” This methodgivesthespeakertheguidetoexpressanger which graduallyincreasesinexpressiveness. Thedatabasewillcontainonlytheoccurrences ofemotionallyneutralcontent.
1.2.4 Speechrecognitiontoday
Asof2010,afteradecade,thehegemonyofHidden Markov Model(HMM)basedASR systems[12]isstillstanding. Thisapproachusesalanguage modelformulated with consecutiveness-functions,andapronouncation modelwithconfusionfunctions.
The mainreasonofthepopularityofHMMbasedASRsystemsistheefficiencyof handlingmorethousandsofwordsinaformalgrammar. Thisgrammarcanbeusedto focusthevocabularyaroundaspecifictopictoincreasethecorrectnessandreducing thetimecomplexity. HMMcanbetrainedtoaspecificspeakerbutalsocanbetrained onlargedatabasestoworkspeakerindependent.
8 1.INTRODUCTION
Figure1.2: Mouthfocusedsubsetoffeaturepointsof MPEG-4.
1.2.5 MPEG-4
MPEG-4isastandardforfacedescriptionforcommunication.ItusesFeaturePoints (FP)andFacial AnimationParameters(FAP)todescribethestateoftheface. The constantpropertiesofafacealsocanbeexpressedin MPEG-4,forexamplethesizes inFAPU.
Theusageof MPEG-4istypicalinmultimediaapplicationswhereaninteractiveor highlycompressedpre-recordedbehaviorofafaceisneeded,suchasvideogamesor newsagents. Oneofthe mostpopular MPEG-4systemsisFacegen[13].
Forvisualspeechsynthesis MPEG-4isafairchoicesincethereareplentyofim- plementationsandresources. Thedegreeoffreedomaroundthe mouthisclosetothe actualneeds,buttherearefeatureswhichcannotbe modeledwith MPEG-4,suchas inflation. GerardBaillyetalshowedthatusingmorefeaturepointsaroundthe mouth canincreasenaturalnesssignificantly[14].
1.2.6 Facerendering
Thetaskofthesynthesisofthepicturefromfacedescriptorsisfacerendering. Usually 3Denginesareused,but2Dsystemsarealsocanbefound. Thespectrumofthe approachesandimagequalitiesisverywidefromefficientsimpleimplementationsto musclebasedsimulations[4].
Mostofthefacerenderersuse3Daccelerationandvertexarraystointerpolate, whichisafairlyacceleratedoperationintoday’svideocards.Inthiscaseafewgiven vertexarrayrepresentgivenphasesoftheface,andusinginterpolationtechniques,the statusofthefacecanbeexpressedasaweightedsumofthepropervertexarrays. The resultingstatecanbetexturedandlightedjustasoneoftheoriginaldesignedfacial phases.
1.3. OPEN QUESTIONS 9
1 .3 Openquest ions
Itisclearthatfacemodelingandfacialanimation —subtasksofaudiotovisualspeech conversion —arestillevolvingbut mainlyadevelopmentfields,butthereareexam- plesofresearchareas,suchasapproximationofthehumanskin’sphysicalproperties, connectionofthe modalities,evaluationofAV mapping,whatisspeakerdependentin thearticulation, whatisthe minimalnecessarydegreesoffreedomforperfectfacial modeling.
Mymotivationscovertheapplicableresearchontheconnectionbetweenthemodal- ities. Thisisalsoanopenquestion. Thereareconvenientargumentsforthephysical relationbetweenthe modalities:thespeechorgansareusedbothforaudioandvisual speech,althoughsomeofthemarenotvisible. There mustarephysicaleffectsofthe visiblespeechorganstotheaudiospeech.
Ontheotherhand,therearephenomenawheretheconnectionbetweenthe modal- itiesare minimal.Speechdisorderscaneffecttheaudiospeechwithoutavisibletrace. Ventriloquism(theartofspeakingwithoutlipmovement,usuallyperformedwithpup- petscreatingtheillusionofaspeakingpuppet)isalsoaninterestingexception.
ToavoidinconsistencyIturnedtotheclarifiedtopicofaudiotovisualspeech conversion.
1 .4 Theproposedapproachofthethes is
Thephysicalconnectionbetweenthe modalitiescangiveguidelinetoreachbasiccon- versionfromaudiotovideo,butthisgoalisnotclear withoutspecifiedaspectsof qualities. Thenextchapterwilldetailhowourresearchgroup metthefieldthrough theaidofdeafandhardofhearingpeople. Their mainqualityaspectsoftheresulting visualspeecharethelip-readability,andthenaturalness. Thiswaytheproblemcan beredefinedtosearchthe mostappropriatevisualspeechforthegivenaudiospeech signal,nottorestoretheoriginalvisualarticulation.
Thephysicalconnectioncanbeutilizedeasilythroughdirectconversion. Direct ATVSsystemsarenotspeechrecognitionsystems,thetargetistoproduceananimation withoutrecognizinganyofthelanguagelayersasphonemesorwords,asthispartof theprocessislefttothelip-reader. Becauseofthis,our ATVSusesnophoneme recognition,furthermorethereisnoclassificationpartintheprocess. Thisisthedirect ATVS,avoidinganydiscretetypeofdataintheprocess. DiscreteATVSsystemsare usingvisemesasthevisualmatchofphonemestodescribeagivenstateoftheanimation ofaphoneme,andusinginterpolationbetweenthemtoproducecoarticulation.
Oneofthemostimportantbenefitsofthedirectconversionisthechancetoconserve nonverbalcontentofthespeechsuchasprosody,dynamicsandrhythm. ModularATVS systemshavetosynthesizethesefeaturesto maintainthenaturalnessoftheresult.
1 .5 Re latedd isc ip l ines
1.5.1 Speechinversion
Ourtaskissimilartospeechinversionwhichtendstoextractinformationfromspeech signalaboutthestatesequenceofthespeechorgans. However,speechinversionaims toreproduceeveryspeechorgantoexactlythesamestateasthespeakerusedhis organs,witheveryspeakerdependentproperty[15,16]. ATVSisdifferent,thetargetis toproducealip-readableanimationwhichdependsonlyonthevisiblespeechorgans anddoesnotdependonthespeakerdependentfeaturesofthespeechsignal.
Speechinversionaimstorecoverthestatesequenceofthespeechorgansfromspeech. Averysimple modelandsolutionofthisproblemisthevocaltract. Recentresearch onthisfieldconcernstongue motionand modelsinparticular.
1.5.2 Computergraphics
Synthesisofhumanfaceisachallengingfieldofcomputergraphics. The mainreason ofthehighdifficultyistheverysensitivehumanobserver. Thehumankinddeveloped ahighlysophisticatedcommunicationsystemwithfacialexpressions,itisabasichu- manskilltoidentifyemotionalandcontextualcontentfromaface. Anexampleof cuttingedgefacesynthesissystemsistherenderingsystemofthe movieAvatar(2009) wherethesystemparameterswereextractedfromactors[17]. Therearerecentscien- tificresultsofefficientvolumeconservingdeformationsoffacialskinbasedonmuscular modeling[4]. These modernrendering methodscanreproducecreasingoftheface, whichisperceptionallyimportant.
1.5.3 Phonetics
Thescienceofphoneticsisrelatedto ATVSsystemsbythe ASRbasedapproaches. PhoneticallyinterestingareasaretheASRcomponent,thephonemestringprocessing, therulesappliedonphonemestringstosynthesizevisualspeech,suchasinterpolation rules,ordominancerules.
Thedetailsofarticulation,andtherelationofthephoneticcontentandthefa- cial musclecontrolsisthetopicofarticulatoryphonetics[18,19]. Thisfieldclassifies phonemesbytheirplacesofarticulation:labial-dental,coronal,dorsal,glottal. ATVS systemsareawareofvisiblespeechorgans,solabial-dentalconsonantsareimportant, alongvowelsandarticulations withopen mouth. Forexamplethephoneme“l”is identifiableofit’salveolararticulationsinceitisdonewithopened mouth.
ArticulatoryphoneticshaveimportantresultsforATVSsystems,aswewillseethe detailsofvisualspeechsynthesisfromphonemestrings.
10
Chapter2
Mot ivat ionandthebasesystem
InthischapterIwilldescribethemaintasksIhadtodealwith,showingthemotivation of mythesis.Iwilldescribeabasesystemaswell. Thebasesystemitselfisnotpart ofthecontributionof mythesis,althoughunderstandingthebasesystemisimportant tounderstanding my motivations.
2 .1 SINOSZproject
TheoriginalprojectwithSINOSZ(NationalAssociationofDeafandHearingImpaired) aimeda mobilesystemtohelpdealingwithaudioonlyinformationsourcesforhard ofhearingpeople. Thefirstidea wastovisualizetheaudiodatainsomelearnable representation,buttheassociationrejectedanyvisualizationtechniquewhich mustbe learnedbythedeafcommunity,sothevisualization methodhadtobesomealready knownrepresentationofthespeech. Wehadbasicallytwochoices,toimplementan ASRtogettext,ortranslatetofacial motion. Weexpected moreefficientandrobust qualityoffacial motionconversionwiththecapabilitiesofa mobiledevicein2004.
Thedevelopmentofthe mobileapplication wasinitiated withtheproject. The mobilebranchoftheprojectisoutofthescopeof mythesis,althoughtherequirement ofefficiencyisimportant.
2.1.1 Apracticalview
WhenIstartedtoworkonaudiotovisualspeechconversion,afterexaminingsomeof thesystemsinaspectsofrequirementsandqualitiesdetailedinthepreviouschapter,I decidedtousedirectconversion. The mainreasoninthistimewastogetafunctional andefficienttestsystemassoonaspossibletohaveresultsandfirsthandexperience withthehopeofsufficiallyefficientimplementationlater.
Directconversioncanbedeployedon mobileplatformseasierthandatabasede- pendentclassifiersystems. Notonlythecomputationaltimeis moderated,butthe memoryrequirementsarealsolower. Choosingdirectconversionwastheoptionofthe guaranteedpossibilityofthetestimplementationonthetargetplatform.
11
12 2. MOTIVATION AND THE BASESYSTEM
2 .2 Luc ia
InthebeginningoftheprojectIconvincedtheteamtousedirect mappingbetween modalities. Mytwoimportantreasonsweretheefficiencyandthelackoftherequire- mentofalabeleddatabaseunlikean ASR.Since wedidnothaveanyaudiovisual databases(andevenin2009therearequitefewpubliclyavailable)wehadtothinkon notonlythesystembutthedatabasealso. Directconversiondoesnotneedlabeled dataso manualworkcanbe minimized,whichshortenstheproductiontime.
Sotheplannedsystemcontainedasimpleaudiopreprocessing(LPCor MFCC),a direct mappingtovideofromaudiofeaturevectorsviacode-bookorneuralnetwork, andvisualizationoftheresultonanartificialhead. Wedidnothaveanyface models, neitherwantedtocreateone,sowewerelookingforanavailablehead model.
ThefirsttestsystemusedthetalkingheadofCosietal[10]calledLucia. Thehead modelwasoriginallyusedforexpressivespeechsynthesis. Thesystemused MPEG-4 FAPasinput,andgeneratedarun-timevideoinan OpenGLwindow,andexporting invideofileswasalsoavailable.
2 .3 Thebasesystem
Thebasesystemisanimplementationofdirectconversion2.1. 2.3.1 Databasebuildingfromvideodata
Thedirectconversionneedspairsofaudioandvideodata,sothedatabaseshouldbe a(maybelabeled)audiovisualspeechrecordingwherethevisualinformationisenough tosynthesizeahead model. Thereforewerecordedafacewith markersonthesubset of MPEG-4FPpositions, mostlyaroundthe mouthandjawandalsosomereference points. Basicallythisisapreprocessed multimedia materialspeciallytouseitasa trainingsetforneuralnetworks. Forthispurposethedatashouldnotcontainstrong redundancyforproacticallyacceptablelearningspeed,sothepre-processingincludes thechoiceofanappropriaterepresentationalso. Withinadequaterepresentationthe learning maytake months,or maynotevenconverges.
2.3.2 Audio
Thevoicesignalisprocessedby25frame/sratetobeinsynchronywiththeprocessed videosignal. Oneanalysiswindowis20-40 ms,themaximumnumberofsamplesinthe 40 mswindowtobe2nsamples. Theinputspeechcanbepre-emphasisfilteredwith H(z)=1-0.983z-1. HammingwindowandFFTwithRadix-2algorithmareapplied. The FFTspectraisconvertedto16 mel-scalebands,andlogarithmand DCTisapplied. Such MelFrequencyCepstrumCoefficients(MFCC)featurevectorsarecommonlyused ingeneralspeechrecognitiontasks. The MFCCfeaturevectorsprovidetheinputto theneuralnetworksafterscalingto[-0.9..0.9].
2.3.2 Audio 13
Figure2.1: WorkflowusedinLucia.
14 2. MOTIVATION AND THE BASESYSTEM
2.3.3 Video
Forvideoprocessingweusedtwomethods.Bothmethodsarebasedonvideorecording ofaspeakerandfeaturetrackerapplications. Thefirst methodisbasedon markers only whichareplacedaroundthe mouth. The markers wereselectedasasubsetof the MPEG-4facedescriptionstandard. Trackingthe markersisacomputeraided process;a98%precise markertrackeralgorithm wasdevelopedforthisphase. The mistakes werecorrected manually. The markerpositionsasafunctionoftime were therawdata, which wasnormalizedbycontrolpointssuchasthenosetoeliminate the motionofthe wholehead. Thisgivesa30-36dimensionalspacedependingon markercount. Thisdataisveryredundantandhighdimensional,itisnotsuitable forneuralnetworktraining,so PCA wasappliedtoreducethedimensionalityand eliminatetheredundancy. PCAcanbetreatedasalossycompressionbecauseonly thefirst6parameterswereusedfortraining. Usingonly6coefficientscancauseabout 1pixelerroronPALscreenwhichistheprecisionofthe markertracking. Thefirst4 coefficientcanbeseenonFig2.2.
Thebasesystems’svideodatabaseisasetofvideorecordsofprofessionallip- speakers. Their movingfacesaredescribedbythe15elementsubsetofthestandard MPEG-4featurepoints(FP)set(84). Thesefeaturepointswere markedbycolored dotsonthefaceofthespeakers. Thecoordinatesoffeaturepointswerecalculatedby a markertrackingalgorithm.
The markertrackingalgorithmusedthenumberof markers(nm)asinput,andon eachframeitlookedforthenmmost marker-likeareasofthepicture. The marker- likelinesswasgivenashighenergyfixedsizedblobafteryellowfiltering. Thetracking containedaself-checkbylookingforadditionalmarkers,andbycomparingthemarker- likelinessesofthe marker[..,nm−1,nm,nm+1,..]thegoodtrackingshowstrong decreaseafterthenmmarker.Ifthedecreaseisbefore nmthereare missing markers, ifthedecreaseifafternmthereare misleadingblobsintheframe. Usingtheself-check ofthetracking, manualcorrectionswas made.
TheFPcoordinates means30dimensionalvectorswhicharecompressedbyPCA. WehaverealizedthatthefirstfewPCAbasisvectorshavecloserelationstothebasic movementcomponentsoflips.Suchcomponentscandifferentiatevisemes. Themarker coordinatesaretransformedintothisbasis,andwecanusethetransformationweights asdata(FacePCA).TheFacePCAvectorsarethetargetoutputvaluesoftheneural netduringthetrainingphase[8].
2.3.4 Training
Thesynchronyoftheaudioandvideodataischeckedbyword”papapa”inthebe- ginningandtheendoftherecording. Thefirstopeningofthe mouthbythisbilabial canbesynchronizedwiththeburstintheaudiodata. Thissynchronizationguarantees thatthepairsofaudioandvideodatawererecordedinthesametime. Forthebest resulttheneuralnetworkhastobetrainedonmultiplewindowsofaudiofeaturevectors wherethewindowcounthastobechosenbasedontheoptimaltemporalscope.
2.3.4 Training 15
Figure2.2: Principalcomponentsof MPEG-4featurepointsinthedatabaseofapro- fessionallip-speaker.
16 2. MOTIVATION AND THE BASESYSTEM
Theneuralnetworkisaback-propagationimplementationbyDavideAnguitacalled MatrixBackPropagation[20]. Thisisaveryefficientsoftware,weuseaslightlymodified versionofthesystemtoabletocontinueatrainingsession.
2.3.5 Firstresults
Thedescribedmoduleswereimplementedandtrained. Thesystemwasmeasuredwith arecognitiontestwithdeafpeople. Tosimulateameasurablecommunicationsituation, thetestcoverednumbers,namesofdaysoftheweekandmonths. Asthemeasurement aimedtotellthedifferencebetweentheATVSandarealperson’svideo,thesituation hadtobeinconsiderationofaveragelip-readingcases. Aswefound[8]deafpersons reclineuponcontext morethanhearingpeople.Inthecasesofnumbersornamesof monthsthecontextdefinesclearlytheclassofthe wordbutleavetheactualvalue uncertain. Duringthetestthetestsubjectshadtorecognize70wordsfromvideoclips. Onethirdoftheclipswereoriginalvideoclipfromtherecordingofthedatabase,other onethirdwereoutputofthe ATVSfromaudiosignalsandtheremainingonethird weresynthesizedvideoclipsfromtheextractedvideodata. Thedifferencebetween therecognitionofrealrecordingandthefaceanimationfromtheextractedvideodata givestherecognitionerrorfromtheface modelandthedatabase,asthedifference betweenanimationsfromvideodataandaudiodatagivesthequalityoftheaudioto videoconversion. Table2.1showstheresults.
Table2.1: Recognitionratesofdifferentvideoclips. Material Recognitionrate originalvideo 97%
face modelonvideodata 55%
face modelonaudiodata 48%
2.3.6 Discussion
Inthiscasethe45%shouldbecomparedtothe55%. Theface models,whichisdriven byrecordedvisualspeechdataisthebestpossiblebehaviorofthedirectATVSsystem. Theratiooftheresultsis87%.Itmeansthatthebasesystemcan makedeafpeopleto understand87%ofthebestpossiblesystem. Thiswasveryencouragingexprerience.
The55%isa weakresultcomparedtothe97%oforiginalvideo. Thisfalloffis becauseoftheartificalhead. Thisratiocouldbeenhancedbyusingmoresophisticated head modelsandfacialparameters,butthisdirectionofresearchanddevelopmentis outofscopeofthisthesis.
2.4. JOHNNIE TALKER 17
2 .4 Johnn ie Ta lker
JohnnieTalkerisareal-timesystemwithverylowtimecomplexity,pureAV mapping applicationwithasimpleface modelandfacialanimation.Johnniewasimplemented todemonstratethelowtimecomplexityofdirectATVSapproach. Theapplicationisa developmentofourresearchgroup,ituses myimplementationofAV mapping,Tam´as B´ardi’saudiopreprocessingandB´alintSrancsik’sOpenGLbasedhead model.
Johnnieisfreelydownloadablefromthewebpageoftheauthorofthisdissertation[21]. Itisa WindowsapplicationusingOpenGL.
Becauseofthedemandoflowlatencynotheoreticaldelaywasusedinthissystem. InthenextfewchaptersIwilldescribehowthenaturalnessandtheintelligibilitycan beenhancedbyusingatime windowinthefutureofaudio modality. Thiscanbe implementedbydelayingtheaudiodubto maintainaudio-videosynchronyandusing thefutureaudiointhesametime.Forexampleaphonelinecanbedelayedtothethe- oreticallyoptimaltimewindow. ButsinceJohnniecanbeusedvia microphone,which cannotbedelayed,anyadditionalbufferingwouldcausenoticeablelatencywhichruins thesubjectivesenseofquality. Asoneofthenextchapterswilldescribe,subjective qualityevaluationdependsheavilyonaudio-videosynchrony,andthisphenomenaap- pearsstronglyintheperceptionofasynthesizedvisualspeechofone’sownspeechin real-time.
JohnnieTalkerwasshownonvariousinternationalconferenceswithsuccess.Itwas agoodopportunitytotestlanguageindependenceinpractice.
We werelookingfortechniquestoimprovethequalitiesofthereal-timesystem withoutadditionalrun-timeoverhead. Therewillbeachapterabouta methodwhich canenhancespeakerindependenceofthesystemusingonlydatabase modifications,so norun-timepenaltyneeded.
2 .5 Extend ingd irectconvers ion
2.5.1 Direct ATVSandco-articulation
The mostcommonformofthelanguageisthepersonaltalkwhichisanaudiovisual speechprocess. Ourresearchisfocusedontherelationoftheaudioandthevisualpart oftalkingtobuildasystemconvertingvoicesignalintofaceanimation.
Co-articulationisthephenomenaoftransientphasesinspeechprocess.Inaudio modality,co-articulationistheeffectoftheneighboringphonemestotheactualstate ofspeechinashortwindowofthetime,shorterthanaphonemeduration.Inspeech synthesis,thereisastrongdemandtocreatenaturaltransientsbetweenthecleanstates ofspeech.Invisualspeechsynthesisthisissueisalsoimportant.Inthevisualspeech processtherearevisemesevenifthesynthesizerdoesnotexplicitlyusethisconcept. Visualco-articulationcanbedefinedasasystemofinfluencesbetweenvisemesin time. Becauseofbiologicallimitations,visualco-articulationisslowerthanaudioco- articulation,butsimilarinother ways: neighboringvisemescanhaveeffectoneach
18 2. MOTIVATION AND THE BASESYSTEM
other,therearestrongervisemesthanothers,and mostofthecasescanbedescribed orapproximatedasaninterpolationofneighboringvisemes.
Let mecallasystemvisualspeechtransient model,ifitgenerates mediatestatesof visualspeechunits,suchasvisemes. Anexampleofvisualspeechtransientmodelisthe strictlyadoptedco-articulationconceptonvisemes,thevisualco-articulation,sincethe visemestringprocessinghastodecidehowinterpolationshouldtakeplacebetweenthe visemes. Anotherexampleofvisualspeechtransient modelsisthedirectconversion’s adaptationtolongertime windowsinordertoinclude morethanonephonemeon theaudio modality.Inthiscasethetransientsdependonacousticalproperties.In modularATVSsystems,thetransientsarecodedinrulesdependingonvisemestring neighborhoods.
Utilization
Trainingadirect ATVSneedsaudio-videodatapairs. Sinceplentyofspeechaudio databasesexistbutonlyafewaudiovisualones,buildingadirectATVSmeansbuilding amultimodaldatabasefirst. AdiscreteATVSisamodularsystem,itispossibletouse existingspeechdatabasestotrainvoicerecognition,andseparatelytraintheanimation partonphonemepairsortrigraphs[1]. ThereforedirectATVSneedsaspecialdatabase, butthesystemwillhandleenergyandrhythmnaturally,meanwhileadiscreteATVShas toreassemblethephonemesintoafluidcoarticulationchainofvisemeinterpolations. Letusetheterm“temportalscope”fortheoveralltimeofacoarticulationphenomena, whichmeansthatthestateofthemouthisdependingonthistimeintervalofthespeech signal.IndirectATVSthecalculationofaframeisbasedonthisaudiosignalinterval. IndiscreteATVSthevisemesandthephonemesaresynchronizedandinterpolationis appliedbetweenthem,asitispopularintexttovisualspeechsystems[22].Figure2.3 showsthisdifference.
Figure2.3: Temporalscopeofdiscrete(interpolating)anddirectATVS
Asymmetry
As mutualinformationestimationresultedanygivenstateofthevideodatastream canbecalculatedfairlyonadefinablerelativetimewindowofthespeechsignal. This modelpredictsthatthetransientphaseofthevisiblespeechcanbecalculatedinthe samewayasinthesteadyphaseasFigure2.3shows.
Thismodelgivesapredictionabouttemporalasymmetriesinthemultimodalspeech process. Thisasymmetrycanbeexplainedwith mentalpredictivityinthe motionof
2.5.2Evaluation 19
thefacial musclestofluentlyformthenextphoneme. Detailswillfollowinchapter“Temporalasymmetry”. Speakerindependence
Sincethedirectconversionisusuallyanapproximationtrainedonagivensetofaudio andvideostates,itsuffersheavydependenceonthedatabase. AsIdetailedbefore,for agooddirectATVSagoodlip-speakerneededtosharevisualdatawiththesystem. Talentedlip-speakersarerare,and mostoftheexperiencedlip-speakersare women. Thismeansthatasinglerecordingofonelip-speakergivesnotonlyaspeakerdependent system,butcollecting moreprofessionallip-speakerswouldresultagenderdependent system,sincethestatisticsofthedatawouldheavilybiased,oritisverydifficultto collectenough malelip-speakertothesystem.
Evenifwewouldhaveplentyofprofessionallip-speakers,thereisaquestionabout the mixingthevideodata. Peoplearticulatedifferently.Itisnotguaranteedthata mixtureofgoodarticulationsresultevenanacceptablearticulation. The mostsafe solutionistochooseoneofthelip-speakersasaguaranteedhighqualityarticulation, andtryingtousehis/herperformancewith multiplevoices.
Iwillgiveasolutionforthisprobleminchapter“Speakerindependenceindirect conversion”.
2.5.2 Evaluation
Thebasesystemwaspublishedasastandalonesystem,andwasmeasuredwithsubjec- tiveopinionscoresandintelligibilitytestswithdeafpersons.In mychapteraboutthe comparisonofAV mappings,IwillpositionthedirectATVSamongtheothersusedin theworld.
Oddlytherearequitefewpublicationsondirect ATVS. Thisisstrange,because thesystemisoneofthe mostsimpledesigns.Let metellapersonalexperiencefroma conferenceofEUSIPCO,Florence. AyoungresearcherwasinterestedinourJohnnie demo. He wasfrom ATR,Japan,andhepraisedoursystem. AsIexplainedthe workflowofthesystem,oneachstagehesaid“Wedidthesame”. Eventhenumber of PCAcoefficients wasthesame. Attheend,hesaidthattheirsystemproduce significantlyworseresultsthanours,itwasnotevenpublishedbecauseitwasflawed. Weagreedthenthatthe mostimportantdifferenceisthelip-speaker’sprofessionality.
Hisworkcanbereadinjapanese[23]intheannualreportoftheinstitute,bythe wayinthesameyearwepublishedourresultsinHungarian[24,25,26]
Anotherexampleofdirectconversionpublicityisacomparisonstudyofanunpub- lisheddirectsystem[27]usedasinternalbaseline.
Becauseofthisunderpublicity,itisimportanttopositiondirectATVSamongthe morepopularmodularATVSsystems,sincemostofthevisualspeechsynthesisresearch groupsalsotrytoimplementadirectATVS,buttheireffortsfailbecauseofthequality ofthedatabase. Atthefirstglancethis mayseemasbadnewsforourresearch,but thenoveltyofoursystemisstillunharmedsincetheworkinATRwasidenticalonlyin
thetechnicaldetails,andatrainingsystem’stechnologyitself,withoutthedatabaseis notawholesystem. Ourbasesystemisnewbecauseofthenewtrainingdata,andthe findingoftheneedoftheprofessionallip-speaker. Again,Iwouldliketoemphasizethat differencebetweenourbasesystemandtheonedevelopedin ATRisnot“only”the databasebutthetrainingstrategy,whichisoneofthemostimportantandfundamental partofanylearningsystem.
Thisnewandsuccessfulllearningstrategy makesourbasesystemnovel,butin thisthesisIfocustheresultsof myown,nottheresearchgroup. Johnnie Talker isacontributionofthegroup,andthefollowingextensionsand measurementsare contributionoftheauthorofthisthesis.
InthenextchapterIwillshowhowthebasesystemwiththeessentialdatabaseof theprofessionallip-speakercanberankedamongthewidelyusedATVSsystems.
20
Chapter3
Naturalnessofd irectconvers ion
InthischapterIdiscussthemeasurementofthenaturalnessofsyntheticvisualspeech, andcomparisonofdifferentAV mappingapproaches.
3 .1 Method
Acomparativestudyofaudio-to-visualspeechconversionisdescribedinthischapter. Thedirectfeature-basedconversionapproachiscomparedtovariousindirect ASR- basedsolutions. Thealreadydetailedbasesystemwasusedasdirectconversion. The ASRbasedsolutionsarethemostsophisticatedsystemsactuallyavailableinHungarian. The methodsaretestedinthesameenvironmentintermsofaudiopre-processing andfacial motionvisualization. Subjectiveopinionscoresshowthat withrespectto naturalness,directconversionperformswell. Conversely,withrespecttointelligibility, ASR-basedsystemsperformbetter.
ThethesisabouttheresultsofthecomparisonisimportantbecausenoAV map- pingcomparisons weredonebefore withthenoveltraningdatabaseofprofessional lip-speaker.
3.1.1 Introduction
Adifficultythatarisesincomparingthedifferentapproachesisthattheyusuallyarede- velopedandtestedindependentlybytherespectiveresearchgroups. Different metrics areused,e.g.intelligibilitytestsand/oropinionscores,anddifferentdataandview- ersareapplied[28].InthischapterIdescribeacomparativeevaluationofdifferent AV mappingapproacheswithinthesameworkflow,seeFigure3.1. Theperformance ofeachis measuredintermsofintelligibility, wherelip-readabilityis measured,and naturalness,whereacomparisonwithrealvisualspeechis made.
3.1.2 Audio-to-visual Conversion
Theperformanceoffivedifferentapproacheswillbeevaluated. Thesearesummarized asfollows:
21
22 3. NATURALNESS OF DIRECT CONVERSION
Figure3.1: Multipleconversion methodsweretestedinthesameenvironment
3.1.2 Audio-to-visual Conversion 23
Figure3.2:Structureofdirectconversion. - Areferencebasedonnaturalfacial motion.
- Adirectconversionsystem.
- AnASRbasedsystemthatlinearinterpolatesphonemic/visemictargets. - AninformedASR-basedapproachthathasaccesstothevocabularyofthetest
material(IASR).
- AnuninformedASR(UASR)thatdoesnothaveaccesstothetextvocabulary. Thesearedescribedin moredetailinthefollowingsections.
Directconversion
Weusedourbasesystem,withadatabaseofaprofessionallip-speaker. Thelengthof therecordedspeechwas4250frames.
24 3. NATURALNESS OF DIRECT CONVERSION
ASR-basedconversion
FortheASRbasedapproachesa WeightedFiniteStateTransducer —Hidden Markov- Model(WFST-HMM)decoderisused.Specifically,asystemknownasVOXerver[29] isused, whichcanruninoneoftwo modes:informed,thisexploitsknowledgeof thevocabularyofthetestdata,anduninformed,whichdoesnot.Incomingspeechis convertedto MFCCs,afterwhichblindchannelequalizationisusedtoreducelinear distortioninthecepstraldomain[30]. Speakerindependentcross-worddecision-tree basedtriphoneacoustic modelsareapplied, whichpreviouslyaretrainedusingthe MRBAHungarianspeechdatabase[31],whichisastandardized,phoneticallybalanced HungarianspeechdatabasedevelopenontheBudapestUniversityofTechnologyand Economics.
Theuninformed ASRsystemusesaphoneme-bigramphonotactic modeltocon- strainthedecodingprocess. Thephoneme-bigramprobabilitieswereestimatedfrom the MRBAdatabase.IntheinformedASRsystemazerogramwordlanguage model isused withavocabularysizeof120 words. Wordpronunciations weredetermined automaticallyasdescribedin[32].
Inbothtypesofspeechrecognitionapproachesthe WFST-HMMrecognitionnet- work wasconstructedofflineusingthe AT&TFSMtoolkit[33]. Inthecaseofthe informedsystem,phonemelabelswereprojectedtotheoutputofthetransducerin- steadofwordlabels. Theprecisionofthesegmentationis10 ms.
Visemeinterpolation
Tocomparethedirectandindirectaudio-to-visualconversionsystems,astandard approachforgeneratingvisualparametersistofirstconvertaphonemetoitsequivalent visemeviaalookuptable,thenlinearinterpolatethevisemetargets. Thisapproachto synthesizingfacial motionisoversimplifiedbecausecoarticulationeffectsareignored, butitdoesprovideabaselineonexpectedperformance(worst-casescenario).
Modular ATVS
Toaccountforcoarticulationeffects,a moresophisticatedinterpolationschemeisre- quired.Inparticulartherelativedominanceofneighboringspeechsegmentsonthe articulatorsisrequired. Speechsegmentscanbeclassifiedasdominant,uncertainor mixedaccordingtothelevelofinfluenceexertedonthelocalneighborhood. Tolearn thedominancefunctionsanellipsoidisfittedtothelipsofspeakersinavideosequence articulating Hungariantriphones. Toaidthefitting,thespeakers wearadistinctly coloredlipstick. Dominancefunctionsareestimatedbythevarianceofvisualdatain agivenphoneticneighborhoodset. Thelearneddominancefunctionsareusedtoin- terpolatebetweenthevisualtargetsderivedfromthe ASRoutput[34]. Weusethe implementationofL´aszl´oCzapandJ´anos M´aty´asherewhichproducesPoserscript. FAPsareextractedfromthisformatbythesameworkflowasfromanoriginalrecording.
3.1.3Evaluation 25
Figure3.3: ModularATVSconsistsofanASRsubsystemandatexttovisualspeech subsystem.
Rendering Module
Thevisualizationoftheoutputofthe ATVS methodsiscommontoallapproaches. Theoutputfromthe ATVS modulesarefacialanimationparameters(FAPs), which areappliedtoacommonhead modelforallapproaches. Note,althoughbetterfacial descriptorsthan MPEG-4areavailable, MPEG-4isusedherebecauseour motion capturesystemdoesnotprovide moredetailthanthis. Therenderedvideosequences arecreatedfromtheseFAPsequencesusingthe Avisynth[35]3Dfacerenderer. As themaincomponentsfortheframeworkarecommonbetweenthedifferentapproaches, anydifferencesareduetothedifferencesintheAV mapping methods. Actualframes areshownonFig3.5.
3.1.3 Evaluation
Implementationspecificnoncriticalbehavior(eg. articulationamplitude)shouldbe normalizedtoensurethatthecomparisonisbetweentheessentialqualitiesofthe methods. Todiscoverthesedifferences,apreliminarytestisdone.
Preliminarytest
Totunetheparametersofthesystems,7videos weregeneratedbyeachofthefive mapping methods,andsomesequences werere-synthesizedfromtheoriginalfacial motiondata. Allsequencesstartedandendedwithaclosed mouth,andeachcontained between2-4words. Thespeakerparticipatedinallofthetestswasnotoneofthosewho
26 3. NATURALNESS OF DIRECT CONVERSION
Table3.1: Resultsofpreliminarytestsusedtotunethesystemparameters.Shownare theaverageandstandarddeviationofscores.
Method Averagescore STD UASR 3.82 0.33 Original 3.79 0.24 Linear 3.17 0.4 Direct 3.02 0.41
IASR 2.85 0.72
wereinvolvedintrainingoftheaudio-to-visual-mapping. Thevideoswerepresentedin arandomizedorderto34viewerswhomwereaskedtoratethequalityofthesystems usinganopinionscore(1–5). TheresultsareshowninTable3.1.
Theresultswereunexpected,theIASR,whichusesamoresophisticatedcoarticula- tionmodel,wasexpectedtobeoneofthebestperformingsystems. Closerinvestigation ofthelowerscoresshowedthereasonwasratherduetopooreraudiovisualsynchrony ofIASRthanforUASR.Thereasonofthisphenomenaisthedifferenceofthe mecha- nismoftheinformedandtheuninformedspeechrecognitionprocess. Duringinformed recognitionthetiminginformationisproducedasaconsequenceofthealignmentof thecorrectphonemestothesignal, whichpressesthesegmentboundariesbyusing thecertainphoneticinformation. Theuninformedrecognition may miscategoriesthe phonemebuttheacousticalchangesarethedriverofthesegmentboundaries,sothe resultingsegmentationisclosertotheacousticallyreasonablethanthephonetically drivensegmentation.
Aqualitativedifferencebetweenthedirectandindirectapproachesisthedegree of mouthopening —thedirectapproachtendedtoopenthe mouthonaverage30%
morethantheindirectapproaches. Consequently,tobringthesystemsintothesame dynamicrange,the mouthopeningforthedirect mappingwasdampedby30%. The synchronyoftheASR-basedapproacheswascheckedforsystemicerrors(constantor linearlyincreasingdelays)usingcrosscorrelationoflocallytimeshiftedwindows,but nosystematicpatternsoferrorsweredetected.
3.1.4 Results ASRsubsystem
ThequalityoftheASR-basedapproachisaffectedbytherecognizedphonemestring. Thistypicallyis100%fortheinformedsystemasthetestsetconsistsonlyofasmall numberofwords(monthsoftheyear,daysoftheweek,andnumbersunder100),whilst theuninformedsystemhasatypicalerrorrateof25.21%. Despitethisthe ATVS usingthisinputperformssurprisinglywell. Thelikelyreason mightbethepatternof confusions —oftenphonemesthatareconfusedacousticallyappearvisuallysimilaron thelips. AsecondfactorthataffectstheperformanceoftheASR-basedapproachesis precisionofthesegmentation. Generallytheuninformedsystemsare morepreciseon
3.1.4 Results 27
0 200 400 600 800 1000 1200 1400 1600
−5005000
−5005000
−5005000
−5005000 0
ms
FAP
Trajectories of different methods
Direct IASRUASR Linear
Figure 3.4: Trajectory plot of different methodsforthe word “Hatvanh´arom”
(hOtvOnha:rom).Jawopeningandlipopeningwidthisshown. Notethatthespeaker didnotpronouncetheutteranceperfectly,andtheinformedsystemattemptstoforce a matchwiththecorrectlyrecognizedword. Thisleadstotimealignmentproblems.
theaveragethantheinformedsystems. Theprecisionofthesegmentationcanseverely impactonthesubjectiveopinionscores. Wethereforefirstattempttoquantifythese likelysourcesoferror.
Theinformedrecognitionsystemissimilarinnaturetoforcedalignmentinstandard ASRtasks.Foreachutterancetherecognizerisruninforcedalignment modeforallof thevocabularyentries. The maindifferencebetweentheinformedandtheuninformed recognitionprocessisthedifferent Markovstategraphsforrecognition. Theinformed systemisazerogramwithoutloopback,whiletheuninformedgraphisabigram model graphwheretheprobabilitiesoftheconnectionsdependonlanguagestatistics.
While matchingtheextractedfeatureswiththe Markovianstates,thedifferences arecumulatedinbothscenarios. However,theuninformedsystemallowsfordifferent phonemesoutsideofthevocabularytominimizethecumulatederror.Fortheinformed systemonlythe mostlikelysequenceisallowed, whichcandistortthesegmentation
—seeFigure3.4foranexample wherethespeaker mispronouncesthe word“Hat- vanh´arom”(hOtvOnha:rom,“63”inHungarian). The(mis)segmentationofOtvOmeans IASRAVTSsystemopensthe mouthaftertheonsetofthevowel. Humanperception issensitivetothiserrorandsothisseverelyimpactstheperceivedquality. Without forcingthevocabulary,asystemmayignoreoneoftheconsonantsbutopenthemouth atthecorrecttime.
Notethatthegeneralizationofthisphenomenaisoutofthescopeofthiswork. We havedemonstratedthatthisisaproblemwithcertainimplementationsofHMM-based ASR.Alternative, morerobustimplementations mightalleviatetheseproblems.
28 3. NATURALNESS OF DIRECT CONVERSION
Table3.2: Resultsofopinionscores,averageandstandarddeviation. Method Averagescore STD
Originalfacial motion 3.73 1.01 Directconversion 3.58 0.97
UASR 3.43 1.08
Linearinterpolation 2.73 1.12
IASR 2.67 1.29
Subjectiveopinionscores
Thetestsetupissimilartothepreliminarytestdescribedpreviouslytotunethesystem. However,58viewersareused,andonlyquantativeopinionsurveywasmadeonthescale of1(bad,veryartifical)to5(realspeech).
TheresultoftheopinionscoretestisonTable3.2. Theadvantageofdirectcon- versionagainst UASRisontheedgeofsignificance withp=0.0512as wellasthe differencebetweentheoriginalspeechandthedirectconversion withp=0.06but UASRissignificantlyworsethanoriginalspeechwithp=0.00029. Theresultscom- paredtothepreliminarytestalsoshowthatwithrespecttonaturalness,theexcessive articulationisnotsignificant. Theadvantageofcorrecttimingovercorrectphoneme stringisalsosignificant.
NotethatthelinearinterpolationsystemisexploitingbetterqualityASRresults, butstillperformssignificantlyworsethantheaverageofotherASRbasedapproaches. Thisshowstheimportanceofcorrectlyhandlingvisemedominanceandvisemeneigh- borhoodsensitivityinASRbasedATVSsystems.
Intelligibility
Intelligibilitywasmeasuredwithatestofrecognitionofvideosequenceswithoutsound. Thisisnotthepopular Modified RhymeTest[36]butforourpurposeswithhearing impairedviewersitis morerelevant,sincethekeywordspottingisthe mostcommon lip-readingtask. The58testsubjectshadtoguesswhichwordwassaidfromagivenset of5otherwordsofthesamecategory. Thecategorieswerenumbers,namesof months andthedaysoftheweek. Allthewordsweresaidtwice. Thesetswereintervalsto eliminatethe memorytestfromthetask(forexample“2”,“3”,“4”,“5”,“6”canbe aset). Thistask modelsthesituationofhearingimpairedorverynoisyenvironment whereanATVSsystemcanbeused.Itisassumedthatthecontextisknown,sothe keywordspottingistheclosesttasktotheproblem.
Theperformanceoftheaudio-to-visualspeechconversion methodsreverseinthis taskcomparedtonaturalness. The mainresulthereisthedominanceof ASRbased approaches(Table3.3),andtheinsignificanceofthedifferencebetweeninformedand uninformed ATVSresults(p=0.43)inthistest which maydeservefurtherinvesti- gation. Notethatasthesynchronyisnotanissue withoutvoice,theIASRisthe best.
3.1.5 Conclusion 29
Table3.3: Resultsofrecognitiontests,averageandstandarddeviationofsuccessrate inpercent. Randompickwouldgive20%.
Method Precision STD
IASR 61% 20%
UASR 57% 22%
Original motion 53% 18%
Cartoon 44% 11%
Directconversion 36% 27%
Table3.4: Comparisontotheresultsof ¨OhmanandSalvi[27],aHMMandrulebased systemsintelligibilitytest.Intelligibilityofcorresponding methodsaresimilar.
Methods Prec. Prec. IASR/Ideal 61% 64%
UASR/HMM 57% 54%
Direct/ANN 36% 34%
Asacomparisonwith[27]whereintelligibilityistestedsimilarly, manuallytuned optimalrulebasedfacialparametersareclosetoourIASRsincetherewasnorecognition error,andwithoutvoicethetimealignmentqualityisnotimportant,andourTTVS isrulebased. Their HMMtestissimilartoour UASR,becausebothare without vocabulary,botharetargetingtimealignedphonemestringtobeconvertedtofacial parameters,andourASRisHMMbased. TheirANNsystemisveryclosetoourdirect conversionexceptthetrainingset,itisastandardspeechdatabaseaudio,andarule basedcalculatedtrajectoryvideodata,whileoursystemistrainedonactualrecording ofaprofessionallip-speaker. Howevertheresultsconcerningintelligibilityarecloseto eachother,seeTable3.4. Thisisavalidationoftheresults,sincethecorresponding measurementareclosetoeachother.Itisimportantthat[27]testsonlyintelligibility, andonlybetweenthree methodsofours,soour measurementisbroader.
3.1.5 Conclusion
Ipresentedacomparativestudyofaudio-to-visualspeechconversion methods. We havepresentedacomparisonofourdirectconversionsystemwithconceptuallydifferent conversionsolutions. Asubsetoftheresultscorrelatewithalreadypublishedresults, validatingtheapproachofthecomparison.
WeobservehigherimportanceofthesynchronyoverphonemeprecisioninanASR based ATVSsystem. Therearepublicationsonthehighimpactofcorrecttimingin differentaspects[34,37,38],butourresultshowexplicitlythat moreaccuratetim- ingachieves muchbettersubjectiveevaluationthan moreaccuratephonemesequence. Also,wehaveshownthatintheaspectofsubjectivenaturalnessevaluation,directcon- version(trainedonprofessionallip-speakerarticulation)isa methodwhichproduces thehighestopinionscoreof95.9%ofanoriginalfacial motionrecording withlower
30 3. NATURALNESS OF DIRECT CONVERSION
computationalcomplexitythanASRbasedsolutions.
Fortasks whereintelligibilityisimportant(supportforhearingimpaired,visual informationinnoisyenvironment) modular ATVSisthebestapproachamongthose presented. Our missionofaidinghearingimpairedpeoplecalluponustoconsider usingASRbasedcomponents. Fornaturalness(animation,entertainingapplications) directconversionisagoodchoice. Forbothaspects UASRgivesrelativelygoodbut notoutstandingresults.
3.1.6 Technicaldetails
Markertrackingwasdonefor MPEG-4FP8.88.48.68.18.58.38.78.25.29.29.3 9.15.12.102.1. Duringsynthesis,allFAPs(MPEG-4Facial AnimationParameter) connectedtheseFPswereusedexceptdepthinformation:
•openjaw
•lowertmidlip
•raisebmidlip
•stretchlcornerlip
•stretchrcornerlip
•lowertliplm
•lowertliprm
•raisebliplm
•raisebliprm
•raiselcornerlip
•raisercornerliplowertmidlip o
•raisebmidlip o
•stretchlcornerlipo
•stretchrcornerlipo
•lowertliplmo
•lowertliprmo
•raisebliplmo
•raisebliprmo
•raiselcornerlipo
•raisercornerlipo Innerlipcontourisestimatedfromouter markers.
Yellowpaint wasusedto marktheFPlocationsonthefaceoftherecordedlip- speaker. Thevideorecordingis576iPAL(576x720pixels,25frame/sec,24bit/pixel). Theaudiorecordingis mono48kHz16bitinasilentroom. Furtherconversionswere dependedontheactual method.
Markertrackingwasbasedoncolor matchingandintensitylocalizationframeto frameandthelocationwasidentifiedbytheregion.Inoverlappingregionstheclosest locationonthepreviousframewasusedtoidentifythe marker. Aframewithneutral facewasselectedtouseasthereferencetoFAPU measurement. The markeronthe noseisusedasreferencetoeliminatehead motion.
ThedirectconversionusesamodificationofDavideAnguita’s MatrixBackpropaga- tionwhichenablesreal-timeworkalso. Theneuralnetworkused11framelongwindow ontheinputside(5framestothepastand5framestothefuture),and4principal
3.1.6 Technicaldetails 31
componentweightsofFAPontheoutput.Eachframeontheinputisrepresentedby16 band MFCCfeaturevector. Thetrainingsetofthesystemcontainsstandalonewords andphoneticallybalancedsentences.
IntheASRthespeechsignalwasconvertedtoafrequencyof16kHz. MFCC(Mel FrequencyCepstralCoefficients)-basedfeaturevectorswerecomputedwithdeltaand delta-deltacomponents(39dimensionsintotal). Therecognitionwasperformedona batchofseparatedsamples. Outputannotationsandthesampleswerejoined,andthe synchronybetweenlabelsandthesignalwaschecked manually.
Thevisemestothelinearinterpolation method wereselected manuallyforeach visemein Hungarianfromthetrainingsetofthedirectconversion. Visemesand phonemeswereassignedbyatable. Eachsegmentisalinearinterpolationfromthe actualvisemetothenextone. LinearinterpolationwascalculatedintheFAPrepre- sentation.
TTVSisaVisualBasicimplementedsystemwithaspreadsheetofthetimedpho- neticdata. ThisspreadsheetwaschangedtotheASRoutput. Neighborhooddependent dominancepropertieswerecalculatedandvisemeratioswereextracted.Linearinterpo- lation,restrictionsconcerningbiologicalboundariesand medianfilteringwereapplied inthisorder. TheoutputisaPoserdatafilewhichisappliedtoa model. Thetexture ofthe modelis modifiedtoblackskinanddifferentlycolored MPEG-4FPlocation markers. Theanimationwasrenderedindraftmode,withthefieldofviewandresolu- tionoftheoriginalrecording. Markertrackingwasperformedasdescribedabovewith theexceptionofthedifferentlycolored markers. FAPUvalueswere measuredinthe renderedpixelspace,andFAPvalueswerecalculatedfromFAPUandtracked marker positions.
ThiswasdoneforbothASRruns,uninformedandinformed.
Thetest materialwas manuallysegmentedto2-4wordunits. Thelengthsofthe unitswerearound3seconds. Thesegmentationboundarieswerelistedandthevideo cutwasautomaticallydonewithanAvisynthscript. Weusedan MPEG-4compatible head modelrendererpluginfor Avisynth, withthe model“Alice”of XFaceproject. Theviewpointandthefieldofviewwasadjustedtohaveonlythemouthonthescreen infrontalview.
Duringthetestthesubjectswatchedthevideosfullscreenandusedheadphones.
3 .2 Thes is
I.Ishowedthatdirect AV mapping method, whichis moreef- ficientcomputationallythan modularapproaches, overperforms the modular AV mappinginaspectofnaturalness withaspecific trainingsetofprofessionallip-speaker.[39]
3.2.1 Novelty
ThisisthefirstdirectAVmappingsystemtrainedwithdataofprofessionallip-speaker. ComparisontomodularmethodsisinterestingbecausedirectAV mappingstrainedon lowqualityarticulationcanbeeasilyoverperformedby modularsystemsinaspectof naturalnessandintelligibility.
3.2.2 Measurements
Naturalness was measuredassubjectivesimilaritytohumanarticulation. The mea- surementwasblindandrandomized,thenumberoftestsubjectswas58,andourdirect AV mappingwasnotsignificantlyworsethanoriginalvisualspeech,butthedifference betweenthe modularandtheoriginalwassignificant.
Opinionscoreaveragesanddeviationsshownnosignificantdifferencebetweenhu- manarticulationanddirectconversion,butsignificantdifferencebetweenhumanand modular mappingbasedsystems.
The measurement wasdoneon Hungariandatabase,fluentlyreadspeech. The databasecontains mixedisolatedwordsandsentences.
3.2.3 Limitsofvalidity
Testsweredoneonnormalspeechdatabase,withfullyfocusedperceptionofthetest subjectsongoodaudioandvideoqualityvideos.
3.2.4 Consequences
Usingdirectconversionforareaswherenaturalnessis mostimportantisencouraged. Usingprofessionallip-speakertorecordaudiovisualdatabaseincreasesthequalityto becomparablewiththelevelofhumanarticulation. Otherlaboratoriestrainedtheir systems withnon-professionals,andthosesystems werenotpublicatedduetotheir poorperformance.
32
3.2.4 Consequences 33
Figure3.5: Anexampleoftheimportanceofcorrecttiming.Framesoftheword“Ok- tober”showtimingdifferencesbetween methods. Notethatdirectconversionreceived bestscoreeventhoughitdoesnotclosethelipsonbilabialbutclosesonvelar,andit hasproblemswithliprounding.
Chapter4
Temporalasymmetry
InthischapterIdiscussthe measurementofrelevanttimewindowfordirectAV map- ping,whichisimportanttobuildaaudiotovisualspeechconversionsystemsincethe temporalwindowofinterestcanbedetermined.
4 .1 Method
Thefinetemporalstructureofrelationsofacousticandvisualfeatureshasbeenin- vestigatedtoimproveourspeechtofacialconversionsystem. Mutualinformationof acousticandvisualfeatureshasbeencalculatedwithdifferenttimeshifts. Theresult hasshownthatthe movementoffeaturepointsonthefaceofprofessionallip-speakers canprecedeevenby100msthechangesofacousticparametersofspeechsignal. Con- sideringthistimevariationthequalityofspeechtofaceanimationconversioncanbe improvedbyusingthefuturespeechsoundtotheconversion.
4.1.1 Introduction
Otherresearchprojectsonconversionofspeechaudiosignaltofacialanimationhave concentratedondevelopmentoffeatureextractionmethods,databaseconstructionand systemtraining[40,41].Evaluationandcomparisonofdifferentsystemshavealsohad highimportanceintheliterature.InthischapterIdiscussthetemporalintegrationof acousticfeaturesoptimalforreal-timeconversiontofacialanimation. Thecriticalpart ofsuchsystemsisthebuildingofanoptimalstatistical modelforthecalculationthe videofeaturesfromtheaudiofeatures. Thereinnoknownexactrelationoftheaudio featuresetandvideofeaturesetcurrently,thisisanopenquestionyet.
Thespeechsignalconveysinformationelementsinaveryspecific way. Someof speechsoundsarerelatedrathertoasteadystateofthearticulatoryorgans,others rathertothetransition movements[42]. Ourtargetapplicationisforprovidinga communicationaidtodeafpeople.Professionallip-speakershave5-6phoneme/sspeech ratetoadaptthecommunicationtothedemandofdeafpeoplesosteadystatephases andthetransitionphasesofspeechsoundsarelongerthenineverydayspeechstyle.
35
36 4. TEMPORAL ASYMMETRY
Thesignalfeaturestocharacterizeasoundsteadystatephaseora transitionphase oreventocharacterizeaco-articulationphenomenonwhentheneighboringsoundsare highlyinterrelated,needacarefulselectionofthetemporalscopetocharacterizethe speechandvideosignal.Inour modelweselected5analysiswindowstodescribethe actualframeofspeechplustwopreviousandtwosucceededwindowstocover +/-80 msinterval.Sosuch5elementsequenceofspeechparameterscancharacterizetransient soundsandtheco-articulations.
Wehaverecognizedthatatthebeginningofwordsthelip movementsstartearlier thenthesoundproduction. Sometimes100msearlierthelipsstartto movetothe initialpositionofthesounds.Itwasthetaskofthestatistical modeltohandlethis phenomenon.
Intherefinementphaseofoursystemwehavetriedtooptimizethemodelselecting theoptimaltemporalscopeandfittingofaudioandvideofeatures. The measureof thefittinghasbasedonthe mutualinformationofaudioandvideofeatures[43].
Thebasesystemusesanadjustabletemporalwindowofaudiospeechsignal. The neuralnetworkcanbetrainedtorespondtoanarrayof MFCC windows,usingthe futureand/orpastaudiodata. Theconversioncanbeasgoodastheamountofmutual informationbetweentheaudioandvideorepresentations.
Usingthetrainedneuralnetforcalculationofcontrolparametersoffacial animation model
Theaudioprocessingunitextractstheaudio MFCCfeaturevectorsfromtheinput speechsignal. Fiveframesof MFCCvectorsareusedasinputtothetrainedneural net. TheNNprovideFacePCAweightvectors. Theseareconvertedintothecontrolpa- rametersofa MPEG-4standardfaceanimationmodel. Thetestoffittingofaudioand videofeatureswasbasedonstep-by-steptemporalshiftingoffeaturevectors.Indicator ofthematchingwas mutualinformation.Lowlevel mutualinformation meansthatwe havelowaveragechancetoestimatethefacialparametersfromtheaudiofeatureset. Thetimeshiftvaluetoproducethehighest mutualinformation meansthe maximal averagechancetocalculatetheonekindofknownfeaturesfromtheotherone.
Estimationof mutualinformationneedscomputationintensivealgorithm. The calculationisunrealisticusinglargedatabase with multidimensionalfeaturevectors. Sosingle MFCPCAandFacePCAparameterswereinterrelated. Sincethesinglepa- rametersareorthogonalbutnotindependent,theyarenotadditive. Forexamplethe FacePCA1valuesarenotindependentfromFacePCA2. Themutualinformationcurves eveninsuchcomplexcasescanindicatetheinterrelationsofparameters.
Analternative methodistocalculatecrosscorrelation. Wehavealsotestedthis method.Itneedslesscomputationalpowerbutsomeofrelationsarenotindicatedso itisalowerestimationoftheoretical maximum.
4.1.1Introduction 37
Mutualinformation
MIX,Y =
x∈X y∈Y
P(x,y)logP(x,y)
P(x)P(y) (4.1)
MutualinformationishighifknowingXhelpstofindoutwhatisY,anditislowif XandYareindependent.Tousethismeasurementfortemporalscopetheaudiosignal willbeshiftedintimecomparedtothevideo.Ifthetimeshiftedsignalhasstillhigh mutualinformation,it meansthatthistimevalueshouldbeinthetemporalscope.If thetimeshiftistoohigh, mutualinformationbetweenthevideoandthetimeshifted audiowillbelowduetotherelativeindependenceofdifferentphonemes.
Usingaandvasaudioandvideoframes:
∀∆t∈[−1s,1s]:MI(∆t)=n
t=1
P(at+∆t,vt)log P(at+∆t,vt)
P(at+∆t)P(vt) (4.2) where P(x,y)isestimatedbya2dimensionalhistogramconvolved with Gauss window. Gausswindowisneededtosimulatethecontinuousspaceinthehistogram incases whereonlyafewobservationsarethere. Sinceaudioandvideodataare multidimensionaland MIworkswithonedimensionaldata,allthecoefficientvectors wereprocessed,andtheresultsaresummarized. The mutualinformationvalueshave beenestimatedfrom200x200sizejointdistributionhistograms. Thehistogramshave beensmoothedby Gaussian window. The windowhas10cellradius with2.5cell deviation. The marginaldensitydistributionfunctionshavebeencalculatedfromthe sumofjointdistributionfunctions.
MFCPCAandFacePCA measurements
170secondsofaudioandvideospeechrecordswasprocessed. Thetimeshifthasbeen varied1mssteps. Melfrequencycoefficientsarecalculatedforeachelement.Principal componentanalysis(PCA)hasbeenappliedforeven morecompactrepresentation ofaudiofeaturessincePCAcomponentscanrepresenttheoriginalspeechframesby minimalaverageerroratgivensubspacedimensionality.Inthefollowingthespeech framesaredescribedbysuch MFCPCAparameters.
The MFCPCAparametersare morereadablerepresentationofframesforhuman expertsthanPCAof MFCCfeaturevectors.
The MFCPCAparametershavedirectrelationstothespectrum. ThePCAtrans- formationdoesnotconsiderthesignofthetransformedvectors,sothefirst MFCPCA componentshowsenergy-likerepresentationascanbeseeninFig4.1. Foranother examplethesecond MFCPCAcomponenthaspositivevalueinvoicedspeechframes andnegativeinframesoffricativespeechelements.
Theoriginalvideorecordshave40 msframeratesotohavethepossibilityof1 ms stepsizeshifting,theintermediateshiftedframeparametershavebeencalculatedby interpolationandlowpassfiltering.
38 4. TEMPORAL ASYMMETRY
Figure4.1:Principalcomponentsof MFCfeaturevectors.
4.1.1Introduction 39
Table4.1:Importancerate(variance)ofthe MFCPCA. MFCPCA alone firstntogether
1 77% 77%
2 10% 87%
3 5% 93%
4 2% 95%
Table4.2:Importancerate(variance)oftheFacePCA. FacePCA alone firstntogether
1 90% 90%
2 6% 96%
3 2% 98%
4 1% 99%
Audioandvideosignalaredescribedby1 msfinestepsizesynchronousframes. Thesignalscanbeshiftedrelatedtoeachotherbyfinesteps. Theaudioandvideo representationofthespeechsignalcanbeinterrelatedfrom∆t=-1000msto+1000ms. Suchinterrelationcanbeinvestigatedonlylevelthatasinglevoiceelementhowcan estimatebasedonashiftedvideoelementandviceversaasanaverage.
Ourcalculationisnotabletoexplainthevalueofadditionalinformationofshifted signalcomparedtothe0shiftingvalue.Ifithasanyadditionalinformationitisnot subtracted.Sothecurvesdonotindicatetheneedoftheextensionofthetimescope foreverynon-zerovalue. Rathertheshapeofthecurveandtheshiftvalueofthe maximumhaveaspecific meaning.
Inthenewcoordinatesystemgeneratedbytheprincipalcomponentanalysisthe coordinatescanbecharacterizedbytheimportancerate. Theimportanceratecan expressthatinthegivendirectionwhichportionofthevariancehasbeenproducedin theoriginalspace. Theimportanceratevaluesinthecaseof MFCPCAtransformation areshowninTable4.1.
TheimportanceratevaluesinthecaseofFacePCAtransformationareshownin Table4.2.
Combiningthetwotablesby multiplicationofthetwovectors,acommonimpor- tanceestimationcanbecalculated. Thevaluesexpressthecontributionofparameter pairstothewhole multidimensionaldata.
Thereallyimportantcurvesarethecombinationsofthe1-4principalcomponents. Theirgeneralimportanceisexpressedbythedarknessofthecurves.Potentialsystem- aticerrorshavebeencarefullychecked. Therealsynchronyoftheaudio-videorecords hasbeenadjustedbasedonexplosivesounds. Thenoiseburstofexplosivesandthe openingpositionoflipsarerealthecharacteristics. Thecheckhavebeenrepeatedat theendoftherecordsalso. Thepossiblesynchronyerrorisbelowonevideoframe (40ms).
40 4. TEMPORAL ASYMMETRY
Figure4.2:Shifted1.FacePCAand MFCPCAmutualinformation.Positive ∆tmeans futurevoice. Darknessshowimportance.
4.1.2 Resultsandconclusions
The mutualinformationcurves werecalculatedandplottedforeverypossiblePCA parameterpairintherangeof-1000to1000 mstimeshift. Onlythe mostimportant curvesarepresentedbelowtoshowtherelationofthecomponentshavinghighest eigenvalues. Theearlier movementofthelipsandthe mouthhavebeenobservedin casesofcoarticulationandatthebeginningofwords. Thisdelayhasbeenconsidered asaspecificandnegligibleeffect. Thedelayvaluehasbeenestimatedonly. Our newexperimentsproducedageneralrulewithwelldefineddelayvalues.Someofthe strongestrelationofaudioandvideofeaturesisnotinthesynchronoustimeframes. The mouthstartstoformthearticulationinsomecases100 msearlierandtheaudio parametersfollowitwithsuchdelay.
Thecurvesofmutualinformationvaluesareasymmetricandmovedtowardspositive timeshift(delayinsound). Thismeanstheacousticspeechsignalisabetterprediction basistocalculatethepreviousfaceandlippositionthanthefutureposition. This factisinharmonyofthe mentionedpracticalobservationthatarticulation movement proceedsthespeechproductionatthebeginningofwords. Theexcitationsignalcomes
4.1.2 Resultsandconclusions 41
Figure4.3:Shifted2.FacePCAand MFCPCAmutualinformation.Positive∆tmeans futurevoice. Darknessshowimportance.