• Nem Talált Eredményt

Audiotovisualspeechconversion GergelyFeldhoffer

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Audiotovisualspeechconversion GergelyFeldhoffer"

Copied!
84
0
0

Teljes szövegt

(1)

Aud iotov isua lspeechconvers ion

Gerge lyFe ldhoffer

Athesissubmittedforthedegreeof DoctorofPhilosophy

Scientificadviser: Gy¨orgyTak´acs

FacultyofInformationTechnology

InterdisciplinaryTechnicalSciencesDoctoralSchool P´azm´anyP´eterCatholicUniversity

Budapest,2010

(2)
(3)

Contents

1 Introduction 3

1.1 Definitions... 3

1.1.1 Components ... 4

1.1.2 Qualityissues... 5

1.2 Applications... 6

1.2.1 Synface... 6

1.2.2 Synthesisofnonverbalcomponentsofvisualspeech... 7

1.2.3 Expressivevisualspeech... 7

1.2.4 Speechrecognitiontoday ... 7

1.2.5 MPEG-4 ... 8

1.2.6 Facerendering... 8

1.3 Openquestions... 9

1.4 Theproposedapproachofthethesis... 9

1.5 Relateddisciplines... 10

1.5.1 Speechinversion... 10

1.5.2 Computergraphics... 10

1.5.3 Phonetics... 10

2 Motivationandthebasesystem 11 2.1 SINOSZproject... 11

2.1.1 Apracticalview ... 11

2.2 Lucia... 12

2.3 Thebasesystem... 12

2.3.1 Databasebuildingfromvideodata... 12

2.3.2 Audio... 12

2.3.3 Video... 14

2.3.4 Training... 14

2.3.5 Firstresults... 16

2.3.6 Discussion... 16

2.4 JohnnieTalker ... 17

2.5 Extendingdirectconversion... 17

2.5.1 DirectATVSandco-articulation... 17

2.5.2 Evaluation ... 19

i

(4)

ii CONTENTS

3 Naturalnessofdirectconversion 21

3.1 Method... 21

3.1.1 Introduction ... 21

3.1.2 Audio-to-visualConversion... 21

3.1.3 Evaluation ... 25

3.1.4 Results ... 26

3.1.5 Conclusion ... 29

3.1.6 Technicaldetails... 30

3.2 Thesis... 32

3.2.1 Novelty... 32

3.2.2 Measurements ... 32

3.2.3 Limitsofvalidity... 32

3.2.4 Consequences... 32

4 Temporalasymmetry 35 4.1 Method... 35

4.1.1 Introduction ... 35

4.1.2 Resultsandconclusions... 40

4.1.3 Multichannel MutualInformationestimation... 44

4.1.4 Durationofasymmetry ... 46

4.2 Thesis... 47

4.2.1 Novelty... 47

4.2.2 Measurements ... 47

4.2.3 Limitsofvalidity... 48

4.2.4 Consequences... 48

5 Speakerindependenceindirectconversion 51 5.1 Method... 51

5.1.1 Introduction ... 51

5.1.2 Speakerindependence... 53

5.1.3 Conclusion ... 55

5.2 Thesis... 56

5.2.1 Novelty... 56

5.2.2 Measurements ... 56

5.2.3 Limitsofvalidity... 56

5.2.4 Consequences... 56

6 Visualspeechinaudiotransmittingtelepresenceapplications 57 6.1 Method... 57

6.1.1 Introduction ... 58

6.1.2 Overview... 58

6.1.3 Face model... 59

6.1.4 Visemebaseddecomposition ... 60

6.1.5 Voicerepresentation... 63

(5)

6.1.6 Speexcoding... 63

6.1.7 Neuralnetworktraining... 64

6.1.8 Implementationissues... 65

6.1.9 Results ... 65

6.1.10 Conclusion ... 68

6.2 Thesis... 68

6.2.1 Novelty... 70

6.2.2 Measurements ... 70

6.2.3 Consequences... 70

(6)

Acknow ledgements

I wouldliketothankthehelpof mysupervisor Gy¨orgy Tak´acs,and my actualandformercolleaguesAttilaTihanyi,Tam´asB´ardi,Tam´asHarczos, B´alintSrancsik,andBal´azsOroszi.Iamthankfulformydoctoralschoolfor providingtoolsandcaringenvironmentto mywork,especiallypersonally forJuditNy´eky-GaizlerandTam´asRoska.

IamalsothankfulforIv´anHeged˝us,GergelyJung,J´anosV´ıg, M´at´eT´oth, G´abor D´aniel“Szasza”Szab´o, Bal´azs B´anyai, L´aszl´o M´esz´aros, Szilvia Kov´acs,SoltBucsiSzab´o,AttilaKrebszand M´artonSelmecistudents,who participatedinourresearchgroup.

Myworkwouldbelesswithoutthediscussionswithvisualspeechsynthesis expertsasL´aszl´oCzap,TakaashiKuratateandSashaFagel.

IwouldliketothankalsomyfellowPhDstudentsandfriends,especiallyto B´ela Weiss, GergelySo´os, ´Ad´am R´ak,Zolt´anFodr´ozci, Gaurav Gandhi, Gy¨orgy Cserey, R´obert W´agner, Csaba Benedek, Barnab´as Hegyi, ´Eva Bank´o,Krist´ofIv´an,G´aborPohl,B´alintSass, M´arton Mih´altz,FerencLom- bai,NorbertB´erci,´AkosTar,J´ozsefVeres,Andr´asKiss,D´avidTisza,P´eter Vizi,Bal´azsVarga,L´aszl´oF¨uredi,BenceB´alint,L´aszl´oLaki,L´aszl´oL˝ovei andJ´ozsef Mihaliczafortheirvaluablecommentsanddiscussions.

Ithanktheendlesspatienceandhelpfulnessto Mrs Vida,L´ıvia Adorj´an, MrsHaraszti, MrsK¨ormendy,GabriellaRumi, MrsTihanyiandMrs Mikesy. Ialsothankthesupportofthetechnicalstaffoftheuniversity,especially P´eterTholt,Tam´asCsillagandTam´asRec.

AndfinallybutnotleastIwouldliketothankthepatientandlovingsupport of mywifeBernadettand myfamily.

(7)

Abstract

Inthisthesis,Iproposenewresultsinaudiospeechbasedvisualspeechsynthesis, whichcanbeusedashelpforhardofhearingpeopleorincomputeraidedanimation. Iwilldescribeasynthesistoolwhichisbasedondirectconversionbetweenaudioand video modalities.I willdiscussthepropertiesofthissystem, measuringthespeech qualityandgivesolutionsforoccurrentdrawbacks.I willshowthatusingadequate trainingstrategyiscriticalfordirectconversion. AttheendIconcludethatdirect converisoncanbeusedaswellasotherpopularaudiotovisualspeechconversions,and itiscurrentlyignoredundeservedlybecauseofthelackofefficienttraining.

1

(8)
(9)

Chapter1

Introduct ion

Audiotovisualspeechconversionisanincreasinglypopularapplicableresearchfield today. MainconferencessuchasInterspeechorEurasipstartednewsectionsconcerning multimodalspeechprocessing,Interspeech2008heldaspecialsessiononlyforaudioto visualspeechconversion.

Possibleapplicationsofthefieldarecommunicationaidingtoolsfordeafandhard ofhearingpeople[1]bytakingadvantageofthesophisticatedlip-readingcapabilities ofthesepeople,orlip-syncapplicationsintheanimationindustry,incomputeraided animationsaswellasinreal-timetelepresencebasedvideogames.InthisthesisIwill describesolutionsforbothoftheseapplications.

InthischapterI willshowtheactualstatusofthetopic, motivationsandstate ofthearttechniques. Tounderstandthischapterbasicspeechandsignalprocessing knowledgeisneeded.

1 .1 Defin it ions

Speechisa multimodalprocess. The modalitiescanbeclassifiedasaudiospeechand visualspeech.Iwillusethefollowingterms:

Visualspeech isarepresentationoftheviewofafacetalking.

Visualspeechdata isthe motioninformationofvisiblespeechorgansinanyrepre- sentation.

Phonemeisthebasic meaningdistinctivesegmentalunitoftheaudiospeech.Itis languagedependent.

Viseme isthebasic meaningdistinctivesegmentalunitofthevisualspeech. Also languagedependent.Therearevisemesbelongingtophonemes,andtherearephonemes whichdonothavevisemeinparticular,becausethephonemecanbepronouncedwith morethanonewaysofarticulation.

Automaticspeechrecognition(ASR) isasystemora method whichcanextract phoneticinformationfromaudiospeechsignal. Usuallyaphonemestringisproduced.

Audiotovisualspeech(ATVS)conversionsystemsaretocreateananimationofa faceaccordingtoagivenaudiospeech.

3

(10)

4 1.INTRODUCTION

Figure1.1: Taskofaudiotovisualspeech(ATVS)conversion.

DirectATVSisanATVSwhich mapsaudiorepresentationtovideorepresentation byapproximation.

DiscreteATVSisanATVSwhichusesclassificationintodiscretecategoriesinorder toconnectthe modalities. Usuallyphonemesandvisemesareused.

ModularATVS isanATVSwhichcontainsASRsubsystem,andphoneme-viseme mappingsubsystem. ModularATVSsystemsareusuallydiscrete.

AV mapping isaninput-output methodwheretheinputisaudiodatainanyrep- resentationandtheoutputisvisualdatainanyrepresentation.Inadiscrete ATVS, AV mappingisaphoneme-viseme mapping,inadirectATVSthisisanapproximator. 1.1.1 Components

EachATVSconsistoftheaudiopreprocessor,theAV(audiotovideo) mapping,and thefacesynthesizer. The moststraightforward methodisthejaw-openingdrivenby speechenergy,thissystemiswidelyusedinon-linegames,sotheaudiopreprocessorisa frame-by-frameenergycalculationexpressedindB,theAVmappingisalinearfunction, which mapstheonedimensionalaudiodatatotheonedimensionalvideoparameter, thejawopening. Theface modelisusuallyavertexarrayoftheface,andbymodifying theverticesofthejawthefacesynthesisisdone.Inbelow moresophisticatedcases willbedetailedwherenaturalnessandintelligibilityareissues.

Recentresearchactivitiesareonspeechsignalprocessing methodsspeciallyfor lip-readablefaceanimation[2],facerepresentationandcontroller method[3],andcon- vincinglynaturalfacialanimationsystems[4].

(11)

1.1.2 Qualityissues 5

Audiopreprocessing

Thesesystemsusefeatureextraction methodstogetusefulandcompactinformation fromthespeechsignal. The mostimportantaspectsofqualityherearetheextracted representationdimensionalityandcoveringerror. Forexamplethespectrumcanbe approximatedbyafewchannelsof melbandsreplacingthespeechspectrumwitha certainerror.Inthiscasethedimensionalityisreducedgreatlybyallowingcertainnoise intherepresenteddata. Databasesforneuralnetworkshavetoconsiderdimensionality asaprimaryaspect.

Audiopreprocessing methodscanbeclusteredin manyaspectsastimedomainor frequencydomainfeatureextractors,approximationorclassification,etc. Adeeper analysisofaudiopreprocessing methodsconcerningaudiovisualspeechispublished by[5]resultingthe mainapproachesareapproximatelyequallywell. Thesetraditional approachesarethe MelFrequencyCepstralCoefficients(MFCC)andLinearPrediction Coding(LPC)based methods. AquiteconvenientpropertyofLPCbasedvocaltract estimationisthedirectconnectiontothespeechorgansviathepipeexcitation model. ItseemstobeagoodideatousevocaltractforATVSaswellbutaccordingto[5]it hasnotsignificantly moreusabledata.

AV mapping

Inthisstepthe modalitiesareconnected,visualspeechdataisproducedfromaudio data.

Therearedifferentstrategiesforperformingthisaudiotovisualconversion. Oneap- proachistoexploitautomaticspeechrecognition(ASR)toextractphoneticinformation fromtheacousticsignal. Thisisthenusedinconjunctionwithasetofcoarticulation rulestointerpolateavisemicrepresentationofthephonemes[6,7]. Alternatively,a secondapproachistoextractfeaturesfromtheacousticsignalandconvertdirectly fromthesefeaturestovisualspeech[8,9].

Facesynthesis

Inthisstepthevisualspeechrepresentationisappliedtoaface model. Usuallysepa- ratedtotwoindependentpartsasfacialanimationrepresentationandmodelrendering. Thefacerepresentation mapsthequantitativedataonfacedescriptors. Anexampleof thisisthe MPEG-4standard.Facerenderingisamethodtoproducepictureoranima- tionfromfacedescriptions. Theseareusuallycomputergraphicsrelatedtechniques. 1.1.2 Qualityissues

AnATVScanbeevaluatedondifferentaspects.

-naturalness: how muchisthesimilarityoftheresultofthe ATVSanda realpersonsvisualspeech

(12)

6 1.INTRODUCTION

-intelligibility:howtheresulthelpsthelip-readertounderstandthecontent ofthespeech

-complexity:thesystemsoveralltimeandspacecomplexity

-trainability:howeasyistoenhancethesystem’sotherqualitiesbyexam- ples,isthisprocessfastorslow,isitadaptableorfixed

-speakerdependency:howthesystemperformancevariesbetweendifferent speakers

-contextdependency: howthesystemperformancevariesbetweenspeech contents(eg. asystem, whichtrainedon medicalcontent, mayperform pooreronfinancialcontent)

-languagedependency: howcomplexistoportthesystemtoadifferent language. Thereplacementofthedatabasecanbeenough,or mayrules havetobechanged,eventhepossibilitycanbequestionable.

-acousticalrobustness:howthesystemperformancevariesindifferentacous- ticenvironments,likehighernoise.

1 .2 App l icat ions

InthissectionIdescribesomeoftherecentsystems,andgiveashortdescriptionof theminthequalityaspectsdetailedabove.

1.2.1 Synface

Anexampleof ATVSsystemsistheSynface[1]of KTH,Sweden. Thissystemisde- signedforhearingimpairedbutnotdeafpeopletohandlevoicecallsonphone. Thesys- temconnectsthephonelinetoacomputer,whereaspeechrecognitionsoftwaretrans- latestheincomingspeechsignaltoatimealignedphonemesequence. Thisphoneme sequenceisthebasisoftheanimationcontrol. Eachphonemeisassignedtoaviseme, andtherecognizedsequence makesastringofvisemestoanimate. Thespeechrecog- nitionsubsystemnotjustrecognizesthephonemebut makesthesegmentationalso. Thevisemesequencetimedbythissegmentationinformationgivesthefinalresultof theAV mapping,usingarule-basedstrategy. Therulesetiscreatedbyexamplesof Swedish multimodalspeech.

Thissystemisdefinitelylanguagedependent,itusestheSwedishphonemeset, aSwedish ASR,andarulesetbuiltonSwedishexamples. Ontheotherhandthe systemperformsvery wellinaspectsofintelligibility,acousticalrobustness,speaker andcontextdependency.

(13)

1.2.2Synthesisofnonverbalcomponentsofvisualspeech 7

1.2.2 Synthesisofnonverbalcomponentsofvisualspeech

Anexampleofaudiotovisualnon-verbalspeechestimationisthesystemof Gregor HoferandHiroshiShimodaira[9]. Theirsystemtargetstoextractthecorrecttimeof blinkinspeech. Theaudiopreprocessinginthissystemconcentratesonnon-verbal components,suchasrhythm,andintonation. Comparedtoactualvideos,theoriginal audiosignalwasusedtotesttheprecisionoftheestimation,whichwasabove80%with adecenttimetolerationof100 ms.Itisimportantthattherearetwokindsofblink, oneoftheregulareyecare,fastblink,andtheotheristhenon-verbalvisualspeech componentemphasizedblink. Ofcoursethisworkwasfocusedthesecondvariant.

1.2.3 Expressivevisualspeech

Thisfieldchangedthenamefrom“Emotionalspeech”to“Expressivespeech”because ofpsychologicalreasons.Expressivespeechtargetstosynthesizeorrecognizeemotional expressionsinspeech.Expressingemotionsisveryrelevantinvisualspeech.

Ishowtwoapproachestothefield.PietroCosielalworkonthevirtualhead“Lucia”

[10]toconnectanexpressiveaudiospeechsynthesizerwithavisualspeechsynthesizer. Thistextbasedsystemcanbeusedasanaudiovisualagentonanyinteractive media wheretextcanbeused.Forexpressivevisualspeechitusesvisemesfortextualcontent andfourbasicemotionalstatesofthefaceasexpressivespeechbasis. Theyworkona naturalblendingfunctionofthesestates.

SashaFagelworksonexpressivespeechinabroadsense[11]. Hecreatedamethodto helpcreatingexpressiveaudiovisualdatabasesbyleadingthesubjectthroughemotional stagestoreachthedesiredlevelofexpressiongradually. Thiswayitispossibletorecord emotionallyneutralcontent(eg.“ItwasonFriday”)articulatedwithjoyoranger. The trickistorecordthesentencemultipletimesandinsertingemotionallyrelevantcontent betweentheoccurrences. Oneexamplecouldbethesequence“Troublehappenalways with me!ItwasonFriday. Whatdoyouthinkyouare?!ItwasonFriday.Ihateyou! It wasonFriday.” This methodgivesthespeakertheguidetoexpressanger which graduallyincreasesinexpressiveness. Thedatabasewillcontainonlytheoccurrences ofemotionallyneutralcontent.

1.2.4 Speechrecognitiontoday

Asof2010,afteradecade,thehegemonyofHidden Markov Model(HMM)basedASR systems[12]isstillstanding. Thisapproachusesalanguage modelformulated with consecutiveness-functions,andapronouncation modelwithconfusionfunctions.

The mainreasonofthepopularityofHMMbasedASRsystemsistheefficiencyof handlingmorethousandsofwordsinaformalgrammar. Thisgrammarcanbeusedto focusthevocabularyaroundaspecifictopictoincreasethecorrectnessandreducing thetimecomplexity. HMMcanbetrainedtoaspecificspeakerbutalsocanbetrained onlargedatabasestoworkspeakerindependent.

(14)

8 1.INTRODUCTION

Figure1.2: Mouthfocusedsubsetoffeaturepointsof MPEG-4.

1.2.5 MPEG-4

MPEG-4isastandardforfacedescriptionforcommunication.ItusesFeaturePoints (FP)andFacial AnimationParameters(FAP)todescribethestateoftheface. The constantpropertiesofafacealsocanbeexpressedin MPEG-4,forexamplethesizes inFAPU.

Theusageof MPEG-4istypicalinmultimediaapplicationswhereaninteractiveor highlycompressedpre-recordedbehaviorofafaceisneeded,suchasvideogamesor newsagents. Oneofthe mostpopular MPEG-4systemsisFacegen[13].

Forvisualspeechsynthesis MPEG-4isafairchoicesincethereareplentyofim- plementationsandresources. Thedegreeoffreedomaroundthe mouthisclosetothe actualneeds,buttherearefeatureswhichcannotbe modeledwith MPEG-4,suchas inflation. GerardBaillyetalshowedthatusingmorefeaturepointsaroundthe mouth canincreasenaturalnesssignificantly[14].

1.2.6 Facerendering

Thetaskofthesynthesisofthepicturefromfacedescriptorsisfacerendering. Usually 3Denginesareused,but2Dsystemsarealsocanbefound. Thespectrumofthe approachesandimagequalitiesisverywidefromefficientsimpleimplementationsto musclebasedsimulations[4].

Mostofthefacerenderersuse3Daccelerationandvertexarraystointerpolate, whichisafairlyacceleratedoperationintoday’svideocards.Inthiscaseafewgiven vertexarrayrepresentgivenphasesoftheface,andusinginterpolationtechniques,the statusofthefacecanbeexpressedasaweightedsumofthepropervertexarrays. The resultingstatecanbetexturedandlightedjustasoneoftheoriginaldesignedfacial phases.

(15)

1.3. OPEN QUESTIONS 9

1 .3 Openquest ions

Itisclearthatfacemodelingandfacialanimation —subtasksofaudiotovisualspeech conversion —arestillevolvingbut mainlyadevelopmentfields,butthereareexam- plesofresearchareas,suchasapproximationofthehumanskin’sphysicalproperties, connectionofthe modalities,evaluationofAV mapping,whatisspeakerdependentin thearticulation, whatisthe minimalnecessarydegreesoffreedomforperfectfacial modeling.

Mymotivationscovertheapplicableresearchontheconnectionbetweenthemodal- ities. Thisisalsoanopenquestion. Thereareconvenientargumentsforthephysical relationbetweenthe modalities:thespeechorgansareusedbothforaudioandvisual speech,althoughsomeofthemarenotvisible. There mustarephysicaleffectsofthe visiblespeechorganstotheaudiospeech.

Ontheotherhand,therearephenomenawheretheconnectionbetweenthe modal- itiesare minimal.Speechdisorderscaneffecttheaudiospeechwithoutavisibletrace. Ventriloquism(theartofspeakingwithoutlipmovement,usuallyperformedwithpup- petscreatingtheillusionofaspeakingpuppet)isalsoaninterestingexception.

ToavoidinconsistencyIturnedtotheclarifiedtopicofaudiotovisualspeech conversion.

1 .4 Theproposedapproachofthethes is

Thephysicalconnectionbetweenthe modalitiescangiveguidelinetoreachbasiccon- versionfromaudiotovideo,butthisgoalisnotclear withoutspecifiedaspectsof qualities. Thenextchapterwilldetailhowourresearchgroup metthefieldthrough theaidofdeafandhardofhearingpeople. Their mainqualityaspectsoftheresulting visualspeecharethelip-readability,andthenaturalness. Thiswaytheproblemcan beredefinedtosearchthe mostappropriatevisualspeechforthegivenaudiospeech signal,nottorestoretheoriginalvisualarticulation.

Thephysicalconnectioncanbeutilizedeasilythroughdirectconversion. Direct ATVSsystemsarenotspeechrecognitionsystems,thetargetistoproduceananimation withoutrecognizinganyofthelanguagelayersasphonemesorwords,asthispartof theprocessislefttothelip-reader. Becauseofthis,our ATVSusesnophoneme recognition,furthermorethereisnoclassificationpartintheprocess. Thisisthedirect ATVS,avoidinganydiscretetypeofdataintheprocess. DiscreteATVSsystemsare usingvisemesasthevisualmatchofphonemestodescribeagivenstateoftheanimation ofaphoneme,andusinginterpolationbetweenthemtoproducecoarticulation.

Oneofthemostimportantbenefitsofthedirectconversionisthechancetoconserve nonverbalcontentofthespeechsuchasprosody,dynamicsandrhythm. ModularATVS systemshavetosynthesizethesefeaturesto maintainthenaturalnessoftheresult.

(16)

1 .5 Re latedd isc ip l ines

1.5.1 Speechinversion

Ourtaskissimilartospeechinversionwhichtendstoextractinformationfromspeech signalaboutthestatesequenceofthespeechorgans. However,speechinversionaims toreproduceeveryspeechorgantoexactlythesamestateasthespeakerusedhis organs,witheveryspeakerdependentproperty[15,16]. ATVSisdifferent,thetargetis toproducealip-readableanimationwhichdependsonlyonthevisiblespeechorgans anddoesnotdependonthespeakerdependentfeaturesofthespeechsignal.

Speechinversionaimstorecoverthestatesequenceofthespeechorgansfromspeech. Averysimple modelandsolutionofthisproblemisthevocaltract. Recentresearch onthisfieldconcernstongue motionand modelsinparticular.

1.5.2 Computergraphics

Synthesisofhumanfaceisachallengingfieldofcomputergraphics. The mainreason ofthehighdifficultyistheverysensitivehumanobserver. Thehumankinddeveloped ahighlysophisticatedcommunicationsystemwithfacialexpressions,itisabasichu- manskilltoidentifyemotionalandcontextualcontentfromaface. Anexampleof cuttingedgefacesynthesissystemsistherenderingsystemofthe movieAvatar(2009) wherethesystemparameterswereextractedfromactors[17]. Therearerecentscien- tificresultsofefficientvolumeconservingdeformationsoffacialskinbasedonmuscular modeling[4]. These modernrendering methodscanreproducecreasingoftheface, whichisperceptionallyimportant.

1.5.3 Phonetics

Thescienceofphoneticsisrelatedto ATVSsystemsbythe ASRbasedapproaches. PhoneticallyinterestingareasaretheASRcomponent,thephonemestringprocessing, therulesappliedonphonemestringstosynthesizevisualspeech,suchasinterpolation rules,ordominancerules.

Thedetailsofarticulation,andtherelationofthephoneticcontentandthefa- cial musclecontrolsisthetopicofarticulatoryphonetics[18,19]. Thisfieldclassifies phonemesbytheirplacesofarticulation:labial-dental,coronal,dorsal,glottal. ATVS systemsareawareofvisiblespeechorgans,solabial-dentalconsonantsareimportant, alongvowelsandarticulations withopen mouth. Forexamplethephoneme“l”is identifiableofit’salveolararticulationsinceitisdonewithopened mouth.

ArticulatoryphoneticshaveimportantresultsforATVSsystems,aswewillseethe detailsofvisualspeechsynthesisfromphonemestrings.

10

(17)

Chapter2

Mot ivat ionandthebasesystem

InthischapterIwilldescribethemaintasksIhadtodealwith,showingthemotivation of mythesis.Iwilldescribeabasesystemaswell. Thebasesystemitselfisnotpart ofthecontributionof mythesis,althoughunderstandingthebasesystemisimportant tounderstanding my motivations.

2 .1 SINOSZproject

TheoriginalprojectwithSINOSZ(NationalAssociationofDeafandHearingImpaired) aimeda mobilesystemtohelpdealingwithaudioonlyinformationsourcesforhard ofhearingpeople. Thefirstidea wastovisualizetheaudiodatainsomelearnable representation,buttheassociationrejectedanyvisualizationtechniquewhich mustbe learnedbythedeafcommunity,sothevisualization methodhadtobesomealready knownrepresentationofthespeech. Wehadbasicallytwochoices,toimplementan ASRtogettext,ortranslatetofacial motion. Weexpected moreefficientandrobust qualityoffacial motionconversionwiththecapabilitiesofa mobiledevicein2004.

Thedevelopmentofthe mobileapplication wasinitiated withtheproject. The mobilebranchoftheprojectisoutofthescopeof mythesis,althoughtherequirement ofefficiencyisimportant.

2.1.1 Apracticalview

WhenIstartedtoworkonaudiotovisualspeechconversion,afterexaminingsomeof thesystemsinaspectsofrequirementsandqualitiesdetailedinthepreviouschapter,I decidedtousedirectconversion. The mainreasoninthistimewastogetafunctional andefficienttestsystemassoonaspossibletohaveresultsandfirsthandexperience withthehopeofsufficiallyefficientimplementationlater.

Directconversioncanbedeployedon mobileplatformseasierthandatabasede- pendentclassifiersystems. Notonlythecomputationaltimeis moderated,butthe memoryrequirementsarealsolower. Choosingdirectconversionwastheoptionofthe guaranteedpossibilityofthetestimplementationonthetargetplatform.

11

(18)

12 2. MOTIVATION AND THE BASESYSTEM

2 .2 Luc ia

InthebeginningoftheprojectIconvincedtheteamtousedirect mappingbetween modalities. Mytwoimportantreasonsweretheefficiencyandthelackoftherequire- mentofalabeleddatabaseunlikean ASR.Since wedidnothaveanyaudiovisual databases(andevenin2009therearequitefewpubliclyavailable)wehadtothinkon notonlythesystembutthedatabasealso. Directconversiondoesnotneedlabeled dataso manualworkcanbe minimized,whichshortenstheproductiontime.

Sotheplannedsystemcontainedasimpleaudiopreprocessing(LPCor MFCC),a direct mappingtovideofromaudiofeaturevectorsviacode-bookorneuralnetwork, andvisualizationoftheresultonanartificialhead. Wedidnothaveanyface models, neitherwantedtocreateone,sowewerelookingforanavailablehead model.

ThefirsttestsystemusedthetalkingheadofCosietal[10]calledLucia. Thehead modelwasoriginallyusedforexpressivespeechsynthesis. Thesystemused MPEG-4 FAPasinput,andgeneratedarun-timevideoinan OpenGLwindow,andexporting invideofileswasalsoavailable.

2 .3 Thebasesystem

Thebasesystemisanimplementationofdirectconversion2.1. 2.3.1 Databasebuildingfromvideodata

Thedirectconversionneedspairsofaudioandvideodata,sothedatabaseshouldbe a(maybelabeled)audiovisualspeechrecordingwherethevisualinformationisenough tosynthesizeahead model. Thereforewerecordedafacewith markersonthesubset of MPEG-4FPpositions, mostlyaroundthe mouthandjawandalsosomereference points. Basicallythisisapreprocessed multimedia materialspeciallytouseitasa trainingsetforneuralnetworks. Forthispurposethedatashouldnotcontainstrong redundancyforproacticallyacceptablelearningspeed,sothepre-processingincludes thechoiceofanappropriaterepresentationalso. Withinadequaterepresentationthe learning maytake months,or maynotevenconverges.

2.3.2 Audio

Thevoicesignalisprocessedby25frame/sratetobeinsynchronywiththeprocessed videosignal. Oneanalysiswindowis20-40 ms,themaximumnumberofsamplesinthe 40 mswindowtobe2nsamples. Theinputspeechcanbepre-emphasisfilteredwith H(z)=1-0.983z-1. HammingwindowandFFTwithRadix-2algorithmareapplied. The FFTspectraisconvertedto16 mel-scalebands,andlogarithmand DCTisapplied. Such MelFrequencyCepstrumCoefficients(MFCC)featurevectorsarecommonlyused ingeneralspeechrecognitiontasks. The MFCCfeaturevectorsprovidetheinputto theneuralnetworksafterscalingto[-0.9..0.9].

(19)

2.3.2 Audio 13

Figure2.1: WorkflowusedinLucia.

(20)

14 2. MOTIVATION AND THE BASESYSTEM

2.3.3 Video

Forvideoprocessingweusedtwomethods.Bothmethodsarebasedonvideorecording ofaspeakerandfeaturetrackerapplications. Thefirst methodisbasedon markers only whichareplacedaroundthe mouth. The markers wereselectedasasubsetof the MPEG-4facedescriptionstandard. Trackingthe markersisacomputeraided process;a98%precise markertrackeralgorithm wasdevelopedforthisphase. The mistakes werecorrected manually. The markerpositionsasafunctionoftime were therawdata, which wasnormalizedbycontrolpointssuchasthenosetoeliminate the motionofthe wholehead. Thisgivesa30-36dimensionalspacedependingon markercount. Thisdataisveryredundantandhighdimensional,itisnotsuitable forneuralnetworktraining,so PCA wasappliedtoreducethedimensionalityand eliminatetheredundancy. PCAcanbetreatedasalossycompressionbecauseonly thefirst6parameterswereusedfortraining. Usingonly6coefficientscancauseabout 1pixelerroronPALscreenwhichistheprecisionofthe markertracking. Thefirst4 coefficientcanbeseenonFig2.2.

Thebasesystems’svideodatabaseisasetofvideorecordsofprofessionallip- speakers. Their movingfacesaredescribedbythe15elementsubsetofthestandard MPEG-4featurepoints(FP)set(84). Thesefeaturepointswere markedbycolored dotsonthefaceofthespeakers. Thecoordinatesoffeaturepointswerecalculatedby a markertrackingalgorithm.

The markertrackingalgorithmusedthenumberof markers(nm)asinput,andon eachframeitlookedforthenmmost marker-likeareasofthepicture. The marker- likelinesswasgivenashighenergyfixedsizedblobafteryellowfiltering. Thetracking containedaself-checkbylookingforadditionalmarkers,andbycomparingthemarker- likelinessesofthe marker[..,nm−1,nm,nm+1,..]thegoodtrackingshowstrong decreaseafterthenmmarker.Ifthedecreaseisbefore nmthereare missing markers, ifthedecreaseifafternmthereare misleadingblobsintheframe. Usingtheself-check ofthetracking, manualcorrectionswas made.

TheFPcoordinates means30dimensionalvectorswhicharecompressedbyPCA. WehaverealizedthatthefirstfewPCAbasisvectorshavecloserelationstothebasic movementcomponentsoflips.Suchcomponentscandifferentiatevisemes. Themarker coordinatesaretransformedintothisbasis,andwecanusethetransformationweights asdata(FacePCA).TheFacePCAvectorsarethetargetoutputvaluesoftheneural netduringthetrainingphase[8].

2.3.4 Training

Thesynchronyoftheaudioandvideodataischeckedbyword”papapa”inthebe- ginningandtheendoftherecording. Thefirstopeningofthe mouthbythisbilabial canbesynchronizedwiththeburstintheaudiodata. Thissynchronizationguarantees thatthepairsofaudioandvideodatawererecordedinthesametime. Forthebest resulttheneuralnetworkhastobetrainedonmultiplewindowsofaudiofeaturevectors wherethewindowcounthastobechosenbasedontheoptimaltemporalscope.

(21)

2.3.4 Training 15

Figure2.2: Principalcomponentsof MPEG-4featurepointsinthedatabaseofapro- fessionallip-speaker.

(22)

16 2. MOTIVATION AND THE BASESYSTEM

Theneuralnetworkisaback-propagationimplementationbyDavideAnguitacalled MatrixBackPropagation[20]. Thisisaveryefficientsoftware,weuseaslightlymodified versionofthesystemtoabletocontinueatrainingsession.

2.3.5 Firstresults

Thedescribedmoduleswereimplementedandtrained. Thesystemwasmeasuredwith arecognitiontestwithdeafpeople. Tosimulateameasurablecommunicationsituation, thetestcoverednumbers,namesofdaysoftheweekandmonths. Asthemeasurement aimedtotellthedifferencebetweentheATVSandarealperson’svideo,thesituation hadtobeinconsiderationofaveragelip-readingcases. Aswefound[8]deafpersons reclineuponcontext morethanhearingpeople.Inthecasesofnumbersornamesof monthsthecontextdefinesclearlytheclassofthe wordbutleavetheactualvalue uncertain. Duringthetestthetestsubjectshadtorecognize70wordsfromvideoclips. Onethirdoftheclipswereoriginalvideoclipfromtherecordingofthedatabase,other onethirdwereoutputofthe ATVSfromaudiosignalsandtheremainingonethird weresynthesizedvideoclipsfromtheextractedvideodata. Thedifferencebetween therecognitionofrealrecordingandthefaceanimationfromtheextractedvideodata givestherecognitionerrorfromtheface modelandthedatabase,asthedifference betweenanimationsfromvideodataandaudiodatagivesthequalityoftheaudioto videoconversion. Table2.1showstheresults.

Table2.1: Recognitionratesofdifferentvideoclips. Material Recognitionrate originalvideo 97%

face modelonvideodata 55%

face modelonaudiodata 48%

2.3.6 Discussion

Inthiscasethe45%shouldbecomparedtothe55%. Theface models,whichisdriven byrecordedvisualspeechdataisthebestpossiblebehaviorofthedirectATVSsystem. Theratiooftheresultsis87%.Itmeansthatthebasesystemcan makedeafpeopleto understand87%ofthebestpossiblesystem. Thiswasveryencouragingexprerience.

The55%isa weakresultcomparedtothe97%oforiginalvideo. Thisfalloffis becauseoftheartificalhead. Thisratiocouldbeenhancedbyusingmoresophisticated head modelsandfacialparameters,butthisdirectionofresearchanddevelopmentis outofscopeofthisthesis.

(23)

2.4. JOHNNIE TALKER 17

2 .4 Johnn ie Ta lker

JohnnieTalkerisareal-timesystemwithverylowtimecomplexity,pureAV mapping applicationwithasimpleface modelandfacialanimation.Johnniewasimplemented todemonstratethelowtimecomplexityofdirectATVSapproach. Theapplicationisa developmentofourresearchgroup,ituses myimplementationofAV mapping,Tam´as B´ardi’saudiopreprocessingandB´alintSrancsik’sOpenGLbasedhead model.

Johnnieisfreelydownloadablefromthewebpageoftheauthorofthisdissertation[21]. Itisa WindowsapplicationusingOpenGL.

Becauseofthedemandoflowlatencynotheoreticaldelaywasusedinthissystem. InthenextfewchaptersIwilldescribehowthenaturalnessandtheintelligibilitycan beenhancedbyusingatime windowinthefutureofaudio modality. Thiscanbe implementedbydelayingtheaudiodubto maintainaudio-videosynchronyandusing thefutureaudiointhesametime.Forexampleaphonelinecanbedelayedtothethe- oreticallyoptimaltimewindow. ButsinceJohnniecanbeusedvia microphone,which cannotbedelayed,anyadditionalbufferingwouldcausenoticeablelatencywhichruins thesubjectivesenseofquality. Asoneofthenextchapterswilldescribe,subjective qualityevaluationdependsheavilyonaudio-videosynchrony,andthisphenomenaap- pearsstronglyintheperceptionofasynthesizedvisualspeechofone’sownspeechin real-time.

JohnnieTalkerwasshownonvariousinternationalconferenceswithsuccess.Itwas agoodopportunitytotestlanguageindependenceinpractice.

We werelookingfortechniquestoimprovethequalitiesofthereal-timesystem withoutadditionalrun-timeoverhead. Therewillbeachapterabouta methodwhich canenhancespeakerindependenceofthesystemusingonlydatabase modifications,so norun-timepenaltyneeded.

2 .5 Extend ingd irectconvers ion

2.5.1 Direct ATVSandco-articulation

The mostcommonformofthelanguageisthepersonaltalkwhichisanaudiovisual speechprocess. Ourresearchisfocusedontherelationoftheaudioandthevisualpart oftalkingtobuildasystemconvertingvoicesignalintofaceanimation.

Co-articulationisthephenomenaoftransientphasesinspeechprocess.Inaudio modality,co-articulationistheeffectoftheneighboringphonemestotheactualstate ofspeechinashortwindowofthetime,shorterthanaphonemeduration.Inspeech synthesis,thereisastrongdemandtocreatenaturaltransientsbetweenthecleanstates ofspeech.Invisualspeechsynthesisthisissueisalsoimportant.Inthevisualspeech processtherearevisemesevenifthesynthesizerdoesnotexplicitlyusethisconcept. Visualco-articulationcanbedefinedasasystemofinfluencesbetweenvisemesin time. Becauseofbiologicallimitations,visualco-articulationisslowerthanaudioco- articulation,butsimilarinother ways: neighboringvisemescanhaveeffectoneach

(24)

18 2. MOTIVATION AND THE BASESYSTEM

other,therearestrongervisemesthanothers,and mostofthecasescanbedescribed orapproximatedasaninterpolationofneighboringvisemes.

Let mecallasystemvisualspeechtransient model,ifitgenerates mediatestatesof visualspeechunits,suchasvisemes. Anexampleofvisualspeechtransientmodelisthe strictlyadoptedco-articulationconceptonvisemes,thevisualco-articulation,sincethe visemestringprocessinghastodecidehowinterpolationshouldtakeplacebetweenthe visemes. Anotherexampleofvisualspeechtransient modelsisthedirectconversion’s adaptationtolongertime windowsinordertoinclude morethanonephonemeon theaudio modality.Inthiscasethetransientsdependonacousticalproperties.In modularATVSsystems,thetransientsarecodedinrulesdependingonvisemestring neighborhoods.

Utilization

Trainingadirect ATVSneedsaudio-videodatapairs. Sinceplentyofspeechaudio databasesexistbutonlyafewaudiovisualones,buildingadirectATVSmeansbuilding amultimodaldatabasefirst. AdiscreteATVSisamodularsystem,itispossibletouse existingspeechdatabasestotrainvoicerecognition,andseparatelytraintheanimation partonphonemepairsortrigraphs[1]. ThereforedirectATVSneedsaspecialdatabase, butthesystemwillhandleenergyandrhythmnaturally,meanwhileadiscreteATVShas toreassemblethephonemesintoafluidcoarticulationchainofvisemeinterpolations. Letusetheterm“temportalscope”fortheoveralltimeofacoarticulationphenomena, whichmeansthatthestateofthemouthisdependingonthistimeintervalofthespeech signal.IndirectATVSthecalculationofaframeisbasedonthisaudiosignalinterval. IndiscreteATVSthevisemesandthephonemesaresynchronizedandinterpolationis appliedbetweenthem,asitispopularintexttovisualspeechsystems[22].Figure2.3 showsthisdifference.

Figure2.3: Temporalscopeofdiscrete(interpolating)anddirectATVS

Asymmetry

As mutualinformationestimationresultedanygivenstateofthevideodatastream canbecalculatedfairlyonadefinablerelativetimewindowofthespeechsignal. This modelpredictsthatthetransientphaseofthevisiblespeechcanbecalculatedinthe samewayasinthesteadyphaseasFigure2.3shows.

Thismodelgivesapredictionabouttemporalasymmetriesinthemultimodalspeech process. Thisasymmetrycanbeexplainedwith mentalpredictivityinthe motionof

(25)

2.5.2Evaluation 19

thefacial musclestofluentlyformthenextphoneme. Detailswillfollowinchapter“Temporalasymmetry”. Speakerindependence

Sincethedirectconversionisusuallyanapproximationtrainedonagivensetofaudio andvideostates,itsuffersheavydependenceonthedatabase. AsIdetailedbefore,for agooddirectATVSagoodlip-speakerneededtosharevisualdatawiththesystem. Talentedlip-speakersarerare,and mostoftheexperiencedlip-speakersare women. Thismeansthatasinglerecordingofonelip-speakergivesnotonlyaspeakerdependent system,butcollecting moreprofessionallip-speakerswouldresultagenderdependent system,sincethestatisticsofthedatawouldheavilybiased,oritisverydifficultto collectenough malelip-speakertothesystem.

Evenifwewouldhaveplentyofprofessionallip-speakers,thereisaquestionabout the mixingthevideodata. Peoplearticulatedifferently.Itisnotguaranteedthata mixtureofgoodarticulationsresultevenanacceptablearticulation. The mostsafe solutionistochooseoneofthelip-speakersasaguaranteedhighqualityarticulation, andtryingtousehis/herperformancewith multiplevoices.

Iwillgiveasolutionforthisprobleminchapter“Speakerindependenceindirect conversion”.

2.5.2 Evaluation

Thebasesystemwaspublishedasastandalonesystem,andwasmeasuredwithsubjec- tiveopinionscoresandintelligibilitytestswithdeafpersons.In mychapteraboutthe comparisonofAV mappings,IwillpositionthedirectATVSamongtheothersusedin theworld.

Oddlytherearequitefewpublicationsondirect ATVS. Thisisstrange,because thesystemisoneofthe mostsimpledesigns.Let metellapersonalexperiencefroma conferenceofEUSIPCO,Florence. AyoungresearcherwasinterestedinourJohnnie demo. He wasfrom ATR,Japan,andhepraisedoursystem. AsIexplainedthe workflowofthesystem,oneachstagehesaid“Wedidthesame”. Eventhenumber of PCAcoefficients wasthesame. Attheend,hesaidthattheirsystemproduce significantlyworseresultsthanours,itwasnotevenpublishedbecauseitwasflawed. Weagreedthenthatthe mostimportantdifferenceisthelip-speaker’sprofessionality.

Hisworkcanbereadinjapanese[23]intheannualreportoftheinstitute,bythe wayinthesameyearwepublishedourresultsinHungarian[24,25,26]

Anotherexampleofdirectconversionpublicityisacomparisonstudyofanunpub- lisheddirectsystem[27]usedasinternalbaseline.

Becauseofthisunderpublicity,itisimportanttopositiondirectATVSamongthe morepopularmodularATVSsystems,sincemostofthevisualspeechsynthesisresearch groupsalsotrytoimplementadirectATVS,buttheireffortsfailbecauseofthequality ofthedatabase. Atthefirstglancethis mayseemasbadnewsforourresearch,but thenoveltyofoursystemisstillunharmedsincetheworkinATRwasidenticalonlyin

(26)

thetechnicaldetails,andatrainingsystem’stechnologyitself,withoutthedatabaseis notawholesystem. Ourbasesystemisnewbecauseofthenewtrainingdata,andthe findingoftheneedoftheprofessionallip-speaker. Again,Iwouldliketoemphasizethat differencebetweenourbasesystemandtheonedevelopedin ATRisnot“only”the databasebutthetrainingstrategy,whichisoneofthemostimportantandfundamental partofanylearningsystem.

Thisnewandsuccessfulllearningstrategy makesourbasesystemnovel,butin thisthesisIfocustheresultsof myown,nottheresearchgroup. Johnnie Talker isacontributionofthegroup,andthefollowingextensionsand measurementsare contributionoftheauthorofthisthesis.

InthenextchapterIwillshowhowthebasesystemwiththeessentialdatabaseof theprofessionallip-speakercanberankedamongthewidelyusedATVSsystems.

20

(27)

Chapter3

Naturalnessofd irectconvers ion

InthischapterIdiscussthemeasurementofthenaturalnessofsyntheticvisualspeech, andcomparisonofdifferentAV mappingapproaches.

3 .1 Method

Acomparativestudyofaudio-to-visualspeechconversionisdescribedinthischapter. Thedirectfeature-basedconversionapproachiscomparedtovariousindirect ASR- basedsolutions. Thealreadydetailedbasesystemwasusedasdirectconversion. The ASRbasedsolutionsarethemostsophisticatedsystemsactuallyavailableinHungarian. The methodsaretestedinthesameenvironmentintermsofaudiopre-processing andfacial motionvisualization. Subjectiveopinionscoresshowthat withrespectto naturalness,directconversionperformswell. Conversely,withrespecttointelligibility, ASR-basedsystemsperformbetter.

ThethesisabouttheresultsofthecomparisonisimportantbecausenoAV map- pingcomparisons weredonebefore withthenoveltraningdatabaseofprofessional lip-speaker.

3.1.1 Introduction

Adifficultythatarisesincomparingthedifferentapproachesisthattheyusuallyarede- velopedandtestedindependentlybytherespectiveresearchgroups. Different metrics areused,e.g.intelligibilitytestsand/oropinionscores,anddifferentdataandview- ersareapplied[28].InthischapterIdescribeacomparativeevaluationofdifferent AV mappingapproacheswithinthesameworkflow,seeFigure3.1. Theperformance ofeachis measuredintermsofintelligibility, wherelip-readabilityis measured,and naturalness,whereacomparisonwithrealvisualspeechis made.

3.1.2 Audio-to-visual Conversion

Theperformanceoffivedifferentapproacheswillbeevaluated. Thesearesummarized asfollows:

21

(28)

22 3. NATURALNESS OF DIRECT CONVERSION

Figure3.1: Multipleconversion methodsweretestedinthesameenvironment

(29)

3.1.2 Audio-to-visual Conversion 23

Figure3.2:Structureofdirectconversion. - Areferencebasedonnaturalfacial motion.

- Adirectconversionsystem.

- AnASRbasedsystemthatlinearinterpolatesphonemic/visemictargets. - AninformedASR-basedapproachthathasaccesstothevocabularyofthetest

material(IASR).

- AnuninformedASR(UASR)thatdoesnothaveaccesstothetextvocabulary. Thesearedescribedin moredetailinthefollowingsections.

Directconversion

Weusedourbasesystem,withadatabaseofaprofessionallip-speaker. Thelengthof therecordedspeechwas4250frames.

(30)

24 3. NATURALNESS OF DIRECT CONVERSION

ASR-basedconversion

FortheASRbasedapproachesa WeightedFiniteStateTransducer —Hidden Markov- Model(WFST-HMM)decoderisused.Specifically,asystemknownasVOXerver[29] isused, whichcanruninoneoftwo modes:informed,thisexploitsknowledgeof thevocabularyofthetestdata,anduninformed,whichdoesnot.Incomingspeechis convertedto MFCCs,afterwhichblindchannelequalizationisusedtoreducelinear distortioninthecepstraldomain[30]. Speakerindependentcross-worddecision-tree basedtriphoneacoustic modelsareapplied, whichpreviouslyaretrainedusingthe MRBAHungarianspeechdatabase[31],whichisastandardized,phoneticallybalanced HungarianspeechdatabasedevelopenontheBudapestUniversityofTechnologyand Economics.

Theuninformed ASRsystemusesaphoneme-bigramphonotactic modeltocon- strainthedecodingprocess. Thephoneme-bigramprobabilitieswereestimatedfrom the MRBAdatabase.IntheinformedASRsystemazerogramwordlanguage model isused withavocabularysizeof120 words. Wordpronunciations weredetermined automaticallyasdescribedin[32].

Inbothtypesofspeechrecognitionapproachesthe WFST-HMMrecognitionnet- work wasconstructedofflineusingthe AT&TFSMtoolkit[33]. Inthecaseofthe informedsystem,phonemelabelswereprojectedtotheoutputofthetransducerin- steadofwordlabels. Theprecisionofthesegmentationis10 ms.

Visemeinterpolation

Tocomparethedirectandindirectaudio-to-visualconversionsystems,astandard approachforgeneratingvisualparametersistofirstconvertaphonemetoitsequivalent visemeviaalookuptable,thenlinearinterpolatethevisemetargets. Thisapproachto synthesizingfacial motionisoversimplifiedbecausecoarticulationeffectsareignored, butitdoesprovideabaselineonexpectedperformance(worst-casescenario).

Modular ATVS

Toaccountforcoarticulationeffects,a moresophisticatedinterpolationschemeisre- quired.Inparticulartherelativedominanceofneighboringspeechsegmentsonthe articulatorsisrequired. Speechsegmentscanbeclassifiedasdominant,uncertainor mixedaccordingtothelevelofinfluenceexertedonthelocalneighborhood. Tolearn thedominancefunctionsanellipsoidisfittedtothelipsofspeakersinavideosequence articulating Hungariantriphones. Toaidthefitting,thespeakers wearadistinctly coloredlipstick. Dominancefunctionsareestimatedbythevarianceofvisualdatain agivenphoneticneighborhoodset. Thelearneddominancefunctionsareusedtoin- terpolatebetweenthevisualtargetsderivedfromthe ASRoutput[34]. Weusethe implementationofL´aszl´oCzapandJ´anos M´aty´asherewhichproducesPoserscript. FAPsareextractedfromthisformatbythesameworkflowasfromanoriginalrecording.

(31)

3.1.3Evaluation 25

Figure3.3: ModularATVSconsistsofanASRsubsystemandatexttovisualspeech subsystem.

Rendering Module

Thevisualizationoftheoutputofthe ATVS methodsiscommontoallapproaches. Theoutputfromthe ATVS modulesarefacialanimationparameters(FAPs), which areappliedtoacommonhead modelforallapproaches. Note,althoughbetterfacial descriptorsthan MPEG-4areavailable, MPEG-4isusedherebecauseour motion capturesystemdoesnotprovide moredetailthanthis. Therenderedvideosequences arecreatedfromtheseFAPsequencesusingthe Avisynth[35]3Dfacerenderer. As themaincomponentsfortheframeworkarecommonbetweenthedifferentapproaches, anydifferencesareduetothedifferencesintheAV mapping methods. Actualframes areshownonFig3.5.

3.1.3 Evaluation

Implementationspecificnoncriticalbehavior(eg. articulationamplitude)shouldbe normalizedtoensurethatthecomparisonisbetweentheessentialqualitiesofthe methods. Todiscoverthesedifferences,apreliminarytestisdone.

Preliminarytest

Totunetheparametersofthesystems,7videos weregeneratedbyeachofthefive mapping methods,andsomesequences werere-synthesizedfromtheoriginalfacial motiondata. Allsequencesstartedandendedwithaclosed mouth,andeachcontained between2-4words. Thespeakerparticipatedinallofthetestswasnotoneofthosewho

(32)

26 3. NATURALNESS OF DIRECT CONVERSION

Table3.1: Resultsofpreliminarytestsusedtotunethesystemparameters.Shownare theaverageandstandarddeviationofscores.

Method Averagescore STD UASR 3.82 0.33 Original 3.79 0.24 Linear 3.17 0.4 Direct 3.02 0.41

IASR 2.85 0.72

wereinvolvedintrainingoftheaudio-to-visual-mapping. Thevideoswerepresentedin arandomizedorderto34viewerswhomwereaskedtoratethequalityofthesystems usinganopinionscore(1–5). TheresultsareshowninTable3.1.

Theresultswereunexpected,theIASR,whichusesamoresophisticatedcoarticula- tionmodel,wasexpectedtobeoneofthebestperformingsystems. Closerinvestigation ofthelowerscoresshowedthereasonwasratherduetopooreraudiovisualsynchrony ofIASRthanforUASR.Thereasonofthisphenomenaisthedifferenceofthe mecha- nismoftheinformedandtheuninformedspeechrecognitionprocess. Duringinformed recognitionthetiminginformationisproducedasaconsequenceofthealignmentof thecorrectphonemestothesignal, whichpressesthesegmentboundariesbyusing thecertainphoneticinformation. Theuninformedrecognition may miscategoriesthe phonemebuttheacousticalchangesarethedriverofthesegmentboundaries,sothe resultingsegmentationisclosertotheacousticallyreasonablethanthephonetically drivensegmentation.

Aqualitativedifferencebetweenthedirectandindirectapproachesisthedegree of mouthopening —thedirectapproachtendedtoopenthe mouthonaverage30%

morethantheindirectapproaches. Consequently,tobringthesystemsintothesame dynamicrange,the mouthopeningforthedirect mappingwasdampedby30%. The synchronyoftheASR-basedapproacheswascheckedforsystemicerrors(constantor linearlyincreasingdelays)usingcrosscorrelationoflocallytimeshiftedwindows,but nosystematicpatternsoferrorsweredetected.

3.1.4 Results ASRsubsystem

ThequalityoftheASR-basedapproachisaffectedbytherecognizedphonemestring. Thistypicallyis100%fortheinformedsystemasthetestsetconsistsonlyofasmall numberofwords(monthsoftheyear,daysoftheweek,andnumbersunder100),whilst theuninformedsystemhasatypicalerrorrateof25.21%. Despitethisthe ATVS usingthisinputperformssurprisinglywell. Thelikelyreason mightbethepatternof confusions —oftenphonemesthatareconfusedacousticallyappearvisuallysimilaron thelips. AsecondfactorthataffectstheperformanceoftheASR-basedapproachesis precisionofthesegmentation. Generallytheuninformedsystemsare morepreciseon

(33)

3.1.4 Results 27

0 200 400 600 800 1000 1200 1400 1600

−5005000

−5005000

−5005000

−5005000 0

ms

FAP

Trajectories of different methods

Direct IASRUASR Linear

Figure 3.4: Trajectory plot of different methodsforthe word “Hatvanh´arom”

(hOtvOnha:rom).Jawopeningandlipopeningwidthisshown. Notethatthespeaker didnotpronouncetheutteranceperfectly,andtheinformedsystemattemptstoforce a matchwiththecorrectlyrecognizedword. Thisleadstotimealignmentproblems.

theaveragethantheinformedsystems. Theprecisionofthesegmentationcanseverely impactonthesubjectiveopinionscores. Wethereforefirstattempttoquantifythese likelysourcesoferror.

Theinformedrecognitionsystemissimilarinnaturetoforcedalignmentinstandard ASRtasks.Foreachutterancetherecognizerisruninforcedalignment modeforallof thevocabularyentries. The maindifferencebetweentheinformedandtheuninformed recognitionprocessisthedifferent Markovstategraphsforrecognition. Theinformed systemisazerogramwithoutloopback,whiletheuninformedgraphisabigram model graphwheretheprobabilitiesoftheconnectionsdependonlanguagestatistics.

While matchingtheextractedfeatureswiththe Markovianstates,thedifferences arecumulatedinbothscenarios. However,theuninformedsystemallowsfordifferent phonemesoutsideofthevocabularytominimizethecumulatederror.Fortheinformed systemonlythe mostlikelysequenceisallowed, whichcandistortthesegmentation

—seeFigure3.4foranexample wherethespeaker mispronouncesthe word“Hat- vanh´arom”(hOtvOnha:rom,“63”inHungarian). The(mis)segmentationofOtvOmeans IASRAVTSsystemopensthe mouthaftertheonsetofthevowel. Humanperception issensitivetothiserrorandsothisseverelyimpactstheperceivedquality. Without forcingthevocabulary,asystemmayignoreoneoftheconsonantsbutopenthemouth atthecorrecttime.

Notethatthegeneralizationofthisphenomenaisoutofthescopeofthiswork. We havedemonstratedthatthisisaproblemwithcertainimplementationsofHMM-based ASR.Alternative, morerobustimplementations mightalleviatetheseproblems.

(34)

28 3. NATURALNESS OF DIRECT CONVERSION

Table3.2: Resultsofopinionscores,averageandstandarddeviation. Method Averagescore STD

Originalfacial motion 3.73 1.01 Directconversion 3.58 0.97

UASR 3.43 1.08

Linearinterpolation 2.73 1.12

IASR 2.67 1.29

Subjectiveopinionscores

Thetestsetupissimilartothepreliminarytestdescribedpreviouslytotunethesystem. However,58viewersareused,andonlyquantativeopinionsurveywasmadeonthescale of1(bad,veryartifical)to5(realspeech).

TheresultoftheopinionscoretestisonTable3.2. Theadvantageofdirectcon- versionagainst UASRisontheedgeofsignificance withp=0.0512as wellasthe differencebetweentheoriginalspeechandthedirectconversion withp=0.06but UASRissignificantlyworsethanoriginalspeechwithp=0.00029. Theresultscom- paredtothepreliminarytestalsoshowthatwithrespecttonaturalness,theexcessive articulationisnotsignificant. Theadvantageofcorrecttimingovercorrectphoneme stringisalsosignificant.

NotethatthelinearinterpolationsystemisexploitingbetterqualityASRresults, butstillperformssignificantlyworsethantheaverageofotherASRbasedapproaches. Thisshowstheimportanceofcorrectlyhandlingvisemedominanceandvisemeneigh- borhoodsensitivityinASRbasedATVSsystems.

Intelligibility

Intelligibilitywasmeasuredwithatestofrecognitionofvideosequenceswithoutsound. Thisisnotthepopular Modified RhymeTest[36]butforourpurposeswithhearing impairedviewersitis morerelevant,sincethekeywordspottingisthe mostcommon lip-readingtask. The58testsubjectshadtoguesswhichwordwassaidfromagivenset of5otherwordsofthesamecategory. Thecategorieswerenumbers,namesof months andthedaysoftheweek. Allthewordsweresaidtwice. Thesetswereintervalsto eliminatethe memorytestfromthetask(forexample“2”,“3”,“4”,“5”,“6”canbe aset). Thistask modelsthesituationofhearingimpairedorverynoisyenvironment whereanATVSsystemcanbeused.Itisassumedthatthecontextisknown,sothe keywordspottingistheclosesttasktotheproblem.

Theperformanceoftheaudio-to-visualspeechconversion methodsreverseinthis taskcomparedtonaturalness. The mainresulthereisthedominanceof ASRbased approaches(Table3.3),andtheinsignificanceofthedifferencebetweeninformedand uninformed ATVSresults(p=0.43)inthistest which maydeservefurtherinvesti- gation. Notethatasthesynchronyisnotanissue withoutvoice,theIASRisthe best.

(35)

3.1.5 Conclusion 29

Table3.3: Resultsofrecognitiontests,averageandstandarddeviationofsuccessrate inpercent. Randompickwouldgive20%.

Method Precision STD

IASR 61% 20%

UASR 57% 22%

Original motion 53% 18%

Cartoon 44% 11%

Directconversion 36% 27%

Table3.4: Comparisontotheresultsof ¨OhmanandSalvi[27],aHMMandrulebased systemsintelligibilitytest.Intelligibilityofcorresponding methodsaresimilar.

Methods Prec. Prec. IASR/Ideal 61% 64%

UASR/HMM 57% 54%

Direct/ANN 36% 34%

Asacomparisonwith[27]whereintelligibilityistestedsimilarly, manuallytuned optimalrulebasedfacialparametersareclosetoourIASRsincetherewasnorecognition error,andwithoutvoicethetimealignmentqualityisnotimportant,andourTTVS isrulebased. Their HMMtestissimilartoour UASR,becausebothare without vocabulary,botharetargetingtimealignedphonemestringtobeconvertedtofacial parameters,andourASRisHMMbased. TheirANNsystemisveryclosetoourdirect conversionexceptthetrainingset,itisastandardspeechdatabaseaudio,andarule basedcalculatedtrajectoryvideodata,whileoursystemistrainedonactualrecording ofaprofessionallip-speaker. Howevertheresultsconcerningintelligibilityarecloseto eachother,seeTable3.4. Thisisavalidationoftheresults,sincethecorresponding measurementareclosetoeachother.Itisimportantthat[27]testsonlyintelligibility, andonlybetweenthree methodsofours,soour measurementisbroader.

3.1.5 Conclusion

Ipresentedacomparativestudyofaudio-to-visualspeechconversion methods. We havepresentedacomparisonofourdirectconversionsystemwithconceptuallydifferent conversionsolutions. Asubsetoftheresultscorrelatewithalreadypublishedresults, validatingtheapproachofthecomparison.

WeobservehigherimportanceofthesynchronyoverphonemeprecisioninanASR based ATVSsystem. Therearepublicationsonthehighimpactofcorrecttimingin differentaspects[34,37,38],butourresultshowexplicitlythat moreaccuratetim- ingachieves muchbettersubjectiveevaluationthan moreaccuratephonemesequence. Also,wehaveshownthatintheaspectofsubjectivenaturalnessevaluation,directcon- version(trainedonprofessionallip-speakerarticulation)isa methodwhichproduces thehighestopinionscoreof95.9%ofanoriginalfacial motionrecording withlower

(36)

30 3. NATURALNESS OF DIRECT CONVERSION

computationalcomplexitythanASRbasedsolutions.

Fortasks whereintelligibilityisimportant(supportforhearingimpaired,visual informationinnoisyenvironment) modular ATVSisthebestapproachamongthose presented. Our missionofaidinghearingimpairedpeoplecalluponustoconsider usingASRbasedcomponents. Fornaturalness(animation,entertainingapplications) directconversionisagoodchoice. Forbothaspects UASRgivesrelativelygoodbut notoutstandingresults.

3.1.6 Technicaldetails

Markertrackingwasdonefor MPEG-4FP8.88.48.68.18.58.38.78.25.29.29.3 9.15.12.102.1. Duringsynthesis,allFAPs(MPEG-4Facial AnimationParameter) connectedtheseFPswereusedexceptdepthinformation:

•openjaw

•lowertmidlip

•raisebmidlip

•stretchlcornerlip

•stretchrcornerlip

•lowertliplm

•lowertliprm

•raisebliplm

•raisebliprm

•raiselcornerlip

•raisercornerliplowertmidlip o

•raisebmidlip o

•stretchlcornerlipo

•stretchrcornerlipo

•lowertliplmo

•lowertliprmo

•raisebliplmo

•raisebliprmo

•raiselcornerlipo

•raisercornerlipo Innerlipcontourisestimatedfromouter markers.

Yellowpaint wasusedto marktheFPlocationsonthefaceoftherecordedlip- speaker. Thevideorecordingis576iPAL(576x720pixels,25frame/sec,24bit/pixel). Theaudiorecordingis mono48kHz16bitinasilentroom. Furtherconversionswere dependedontheactual method.

Markertrackingwasbasedoncolor matchingandintensitylocalizationframeto frameandthelocationwasidentifiedbytheregion.Inoverlappingregionstheclosest locationonthepreviousframewasusedtoidentifythe marker. Aframewithneutral facewasselectedtouseasthereferencetoFAPU measurement. The markeronthe noseisusedasreferencetoeliminatehead motion.

ThedirectconversionusesamodificationofDavideAnguita’s MatrixBackpropaga- tionwhichenablesreal-timeworkalso. Theneuralnetworkused11framelongwindow ontheinputside(5framestothepastand5framestothefuture),and4principal

(37)

3.1.6 Technicaldetails 31

componentweightsofFAPontheoutput.Eachframeontheinputisrepresentedby16 band MFCCfeaturevector. Thetrainingsetofthesystemcontainsstandalonewords andphoneticallybalancedsentences.

IntheASRthespeechsignalwasconvertedtoafrequencyof16kHz. MFCC(Mel FrequencyCepstralCoefficients)-basedfeaturevectorswerecomputedwithdeltaand delta-deltacomponents(39dimensionsintotal). Therecognitionwasperformedona batchofseparatedsamples. Outputannotationsandthesampleswerejoined,andthe synchronybetweenlabelsandthesignalwaschecked manually.

Thevisemestothelinearinterpolation method wereselected manuallyforeach visemein Hungarianfromthetrainingsetofthedirectconversion. Visemesand phonemeswereassignedbyatable. Eachsegmentisalinearinterpolationfromthe actualvisemetothenextone. LinearinterpolationwascalculatedintheFAPrepre- sentation.

TTVSisaVisualBasicimplementedsystemwithaspreadsheetofthetimedpho- neticdata. ThisspreadsheetwaschangedtotheASRoutput. Neighborhooddependent dominancepropertieswerecalculatedandvisemeratioswereextracted.Linearinterpo- lation,restrictionsconcerningbiologicalboundariesand medianfilteringwereapplied inthisorder. TheoutputisaPoserdatafilewhichisappliedtoa model. Thetexture ofthe modelis modifiedtoblackskinanddifferentlycolored MPEG-4FPlocation markers. Theanimationwasrenderedindraftmode,withthefieldofviewandresolu- tionoftheoriginalrecording. Markertrackingwasperformedasdescribedabovewith theexceptionofthedifferentlycolored markers. FAPUvalueswere measuredinthe renderedpixelspace,andFAPvalueswerecalculatedfromFAPUandtracked marker positions.

ThiswasdoneforbothASRruns,uninformedandinformed.

Thetest materialwas manuallysegmentedto2-4wordunits. Thelengthsofthe unitswerearound3seconds. Thesegmentationboundarieswerelistedandthevideo cutwasautomaticallydonewithanAvisynthscript. Weusedan MPEG-4compatible head modelrendererpluginfor Avisynth, withthe model“Alice”of XFaceproject. Theviewpointandthefieldofviewwasadjustedtohaveonlythemouthonthescreen infrontalview.

Duringthetestthesubjectswatchedthevideosfullscreenandusedheadphones.

(38)

3 .2 Thes is

I.Ishowedthatdirect AV mapping method, whichis moreef- ficientcomputationallythan modularapproaches, overperforms the modular AV mappinginaspectofnaturalness withaspecific trainingsetofprofessionallip-speaker.[39]

3.2.1 Novelty

ThisisthefirstdirectAVmappingsystemtrainedwithdataofprofessionallip-speaker. ComparisontomodularmethodsisinterestingbecausedirectAV mappingstrainedon lowqualityarticulationcanbeeasilyoverperformedby modularsystemsinaspectof naturalnessandintelligibility.

3.2.2 Measurements

Naturalness was measuredassubjectivesimilaritytohumanarticulation. The mea- surementwasblindandrandomized,thenumberoftestsubjectswas58,andourdirect AV mappingwasnotsignificantlyworsethanoriginalvisualspeech,butthedifference betweenthe modularandtheoriginalwassignificant.

Opinionscoreaveragesanddeviationsshownnosignificantdifferencebetweenhu- manarticulationanddirectconversion,butsignificantdifferencebetweenhumanand modular mappingbasedsystems.

The measurement wasdoneon Hungariandatabase,fluentlyreadspeech. The databasecontains mixedisolatedwordsandsentences.

3.2.3 Limitsofvalidity

Testsweredoneonnormalspeechdatabase,withfullyfocusedperceptionofthetest subjectsongoodaudioandvideoqualityvideos.

3.2.4 Consequences

Usingdirectconversionforareaswherenaturalnessis mostimportantisencouraged. Usingprofessionallip-speakertorecordaudiovisualdatabaseincreasesthequalityto becomparablewiththelevelofhumanarticulation. Otherlaboratoriestrainedtheir systems withnon-professionals,andthosesystems werenotpublicatedduetotheir poorperformance.

32

(39)

3.2.4 Consequences 33

Figure3.5: Anexampleoftheimportanceofcorrecttiming.Framesoftheword“Ok- tober”showtimingdifferencesbetween methods. Notethatdirectconversionreceived bestscoreeventhoughitdoesnotclosethelipsonbilabialbutclosesonvelar,andit hasproblemswithliprounding.

(40)
(41)

Chapter4

Temporalasymmetry

InthischapterIdiscussthe measurementofrelevanttimewindowfordirectAV map- ping,whichisimportanttobuildaaudiotovisualspeechconversionsystemsincethe temporalwindowofinterestcanbedetermined.

4 .1 Method

Thefinetemporalstructureofrelationsofacousticandvisualfeatureshasbeenin- vestigatedtoimproveourspeechtofacialconversionsystem. Mutualinformationof acousticandvisualfeatureshasbeencalculatedwithdifferenttimeshifts. Theresult hasshownthatthe movementoffeaturepointsonthefaceofprofessionallip-speakers canprecedeevenby100msthechangesofacousticparametersofspeechsignal. Con- sideringthistimevariationthequalityofspeechtofaceanimationconversioncanbe improvedbyusingthefuturespeechsoundtotheconversion.

4.1.1 Introduction

Otherresearchprojectsonconversionofspeechaudiosignaltofacialanimationhave concentratedondevelopmentoffeatureextractionmethods,databaseconstructionand systemtraining[40,41].Evaluationandcomparisonofdifferentsystemshavealsohad highimportanceintheliterature.InthischapterIdiscussthetemporalintegrationof acousticfeaturesoptimalforreal-timeconversiontofacialanimation. Thecriticalpart ofsuchsystemsisthebuildingofanoptimalstatistical modelforthecalculationthe videofeaturesfromtheaudiofeatures. Thereinnoknownexactrelationoftheaudio featuresetandvideofeaturesetcurrently,thisisanopenquestionyet.

Thespeechsignalconveysinformationelementsinaveryspecific way. Someof speechsoundsarerelatedrathertoasteadystateofthearticulatoryorgans,others rathertothetransition movements[42]. Ourtargetapplicationisforprovidinga communicationaidtodeafpeople.Professionallip-speakershave5-6phoneme/sspeech ratetoadaptthecommunicationtothedemandofdeafpeoplesosteadystatephases andthetransitionphasesofspeechsoundsarelongerthenineverydayspeechstyle.

35

(42)

36 4. TEMPORAL ASYMMETRY

Thesignalfeaturestocharacterizeasoundsteadystatephaseora transitionphase oreventocharacterizeaco-articulationphenomenonwhentheneighboringsoundsare highlyinterrelated,needacarefulselectionofthetemporalscopetocharacterizethe speechandvideosignal.Inour modelweselected5analysiswindowstodescribethe actualframeofspeechplustwopreviousandtwosucceededwindowstocover +/-80 msinterval.Sosuch5elementsequenceofspeechparameterscancharacterizetransient soundsandtheco-articulations.

Wehaverecognizedthatatthebeginningofwordsthelip movementsstartearlier thenthesoundproduction. Sometimes100msearlierthelipsstartto movetothe initialpositionofthesounds.Itwasthetaskofthestatistical modeltohandlethis phenomenon.

Intherefinementphaseofoursystemwehavetriedtooptimizethemodelselecting theoptimaltemporalscopeandfittingofaudioandvideofeatures. The measureof thefittinghasbasedonthe mutualinformationofaudioandvideofeatures[43].

Thebasesystemusesanadjustabletemporalwindowofaudiospeechsignal. The neuralnetworkcanbetrainedtorespondtoanarrayof MFCC windows,usingthe futureand/orpastaudiodata. Theconversioncanbeasgoodastheamountofmutual informationbetweentheaudioandvideorepresentations.

Usingthetrainedneuralnetforcalculationofcontrolparametersoffacial animation model

Theaudioprocessingunitextractstheaudio MFCCfeaturevectorsfromtheinput speechsignal. Fiveframesof MFCCvectorsareusedasinputtothetrainedneural net. TheNNprovideFacePCAweightvectors. Theseareconvertedintothecontrolpa- rametersofa MPEG-4standardfaceanimationmodel. Thetestoffittingofaudioand videofeatureswasbasedonstep-by-steptemporalshiftingoffeaturevectors.Indicator ofthematchingwas mutualinformation.Lowlevel mutualinformation meansthatwe havelowaveragechancetoestimatethefacialparametersfromtheaudiofeatureset. Thetimeshiftvaluetoproducethehighest mutualinformation meansthe maximal averagechancetocalculatetheonekindofknownfeaturesfromtheotherone.

Estimationof mutualinformationneedscomputationintensivealgorithm. The calculationisunrealisticusinglargedatabase with multidimensionalfeaturevectors. Sosingle MFCPCAandFacePCAparameterswereinterrelated. Sincethesinglepa- rametersareorthogonalbutnotindependent,theyarenotadditive. Forexamplethe FacePCA1valuesarenotindependentfromFacePCA2. Themutualinformationcurves eveninsuchcomplexcasescanindicatetheinterrelationsofparameters.

Analternative methodistocalculatecrosscorrelation. Wehavealsotestedthis method.Itneedslesscomputationalpowerbutsomeofrelationsarenotindicatedso itisalowerestimationoftheoretical maximum.

(43)

4.1.1Introduction 37

Mutualinformation

MIX,Y =

x∈X y∈Y

P(x,y)logP(x,y)

P(x)P(y) (4.1)

MutualinformationishighifknowingXhelpstofindoutwhatisY,anditislowif XandYareindependent.Tousethismeasurementfortemporalscopetheaudiosignal willbeshiftedintimecomparedtothevideo.Ifthetimeshiftedsignalhasstillhigh mutualinformation,it meansthatthistimevalueshouldbeinthetemporalscope.If thetimeshiftistoohigh, mutualinformationbetweenthevideoandthetimeshifted audiowillbelowduetotherelativeindependenceofdifferentphonemes.

Usingaandvasaudioandvideoframes:

∀∆t∈[−1s,1s]:MI(∆t)=n

t=1

P(at+∆t,vt)log P(at+∆t,vt)

P(at+∆t)P(vt) (4.2) where P(x,y)isestimatedbya2dimensionalhistogramconvolved with Gauss window. Gausswindowisneededtosimulatethecontinuousspaceinthehistogram incases whereonlyafewobservationsarethere. Sinceaudioandvideodataare multidimensionaland MIworkswithonedimensionaldata,allthecoefficientvectors wereprocessed,andtheresultsaresummarized. The mutualinformationvalueshave beenestimatedfrom200x200sizejointdistributionhistograms. Thehistogramshave beensmoothedby Gaussian window. The windowhas10cellradius with2.5cell deviation. The marginaldensitydistributionfunctionshavebeencalculatedfromthe sumofjointdistributionfunctions.

MFCPCAandFacePCA measurements

170secondsofaudioandvideospeechrecordswasprocessed. Thetimeshifthasbeen varied1mssteps. Melfrequencycoefficientsarecalculatedforeachelement.Principal componentanalysis(PCA)hasbeenappliedforeven morecompactrepresentation ofaudiofeaturessincePCAcomponentscanrepresenttheoriginalspeechframesby minimalaverageerroratgivensubspacedimensionality.Inthefollowingthespeech framesaredescribedbysuch MFCPCAparameters.

The MFCPCAparametersare morereadablerepresentationofframesforhuman expertsthanPCAof MFCCfeaturevectors.

The MFCPCAparametershavedirectrelationstothespectrum. ThePCAtrans- formationdoesnotconsiderthesignofthetransformedvectors,sothefirst MFCPCA componentshowsenergy-likerepresentationascanbeseeninFig4.1. Foranother examplethesecond MFCPCAcomponenthaspositivevalueinvoicedspeechframes andnegativeinframesoffricativespeechelements.

Theoriginalvideorecordshave40 msframeratesotohavethepossibilityof1 ms stepsizeshifting,theintermediateshiftedframeparametershavebeencalculatedby interpolationandlowpassfiltering.

(44)

38 4. TEMPORAL ASYMMETRY

Figure4.1:Principalcomponentsof MFCfeaturevectors.

(45)

4.1.1Introduction 39

Table4.1:Importancerate(variance)ofthe MFCPCA. MFCPCA alone firstntogether

1 77% 77%

2 10% 87%

3 5% 93%

4 2% 95%

Table4.2:Importancerate(variance)oftheFacePCA. FacePCA alone firstntogether

1 90% 90%

2 6% 96%

3 2% 98%

4 1% 99%

Audioandvideosignalaredescribedby1 msfinestepsizesynchronousframes. Thesignalscanbeshiftedrelatedtoeachotherbyfinesteps. Theaudioandvideo representationofthespeechsignalcanbeinterrelatedfrom∆t=-1000msto+1000ms. Suchinterrelationcanbeinvestigatedonlylevelthatasinglevoiceelementhowcan estimatebasedonashiftedvideoelementandviceversaasanaverage.

Ourcalculationisnotabletoexplainthevalueofadditionalinformationofshifted signalcomparedtothe0shiftingvalue.Ifithasanyadditionalinformationitisnot subtracted.Sothecurvesdonotindicatetheneedoftheextensionofthetimescope foreverynon-zerovalue. Rathertheshapeofthecurveandtheshiftvalueofthe maximumhaveaspecific meaning.

Inthenewcoordinatesystemgeneratedbytheprincipalcomponentanalysisthe coordinatescanbecharacterizedbytheimportancerate. Theimportanceratecan expressthatinthegivendirectionwhichportionofthevariancehasbeenproducedin theoriginalspace. Theimportanceratevaluesinthecaseof MFCPCAtransformation areshowninTable4.1.

TheimportanceratevaluesinthecaseofFacePCAtransformationareshownin Table4.2.

Combiningthetwotablesby multiplicationofthetwovectors,acommonimpor- tanceestimationcanbecalculated. Thevaluesexpressthecontributionofparameter pairstothewhole multidimensionaldata.

Thereallyimportantcurvesarethecombinationsofthe1-4principalcomponents. Theirgeneralimportanceisexpressedbythedarknessofthecurves.Potentialsystem- aticerrorshavebeencarefullychecked. Therealsynchronyoftheaudio-videorecords hasbeenadjustedbasedonexplosivesounds. Thenoiseburstofexplosivesandthe openingpositionoflipsarerealthecharacteristics. Thecheckhavebeenrepeatedat theendoftherecordsalso. Thepossiblesynchronyerrorisbelowonevideoframe (40ms).

(46)

40 4. TEMPORAL ASYMMETRY

Figure4.2:Shifted1.FacePCAand MFCPCAmutualinformation.Positive ∆tmeans futurevoice. Darknessshowimportance.

4.1.2 Resultsandconclusions

The mutualinformationcurves werecalculatedandplottedforeverypossiblePCA parameterpairintherangeof-1000to1000 mstimeshift. Onlythe mostimportant curvesarepresentedbelowtoshowtherelationofthecomponentshavinghighest eigenvalues. Theearlier movementofthelipsandthe mouthhavebeenobservedin casesofcoarticulationandatthebeginningofwords. Thisdelayhasbeenconsidered asaspecificandnegligibleeffect. Thedelayvaluehasbeenestimatedonly. Our newexperimentsproducedageneralrulewithwelldefineddelayvalues.Someofthe strongestrelationofaudioandvideofeaturesisnotinthesynchronoustimeframes. The mouthstartstoformthearticulationinsomecases100 msearlierandtheaudio parametersfollowitwithsuchdelay.

Thecurvesofmutualinformationvaluesareasymmetricandmovedtowardspositive timeshift(delayinsound). Thismeanstheacousticspeechsignalisabetterprediction basistocalculatethepreviousfaceandlippositionthanthefutureposition. This factisinharmonyofthe mentionedpracticalobservationthatarticulation movement proceedsthespeechproductionatthebeginningofwords. Theexcitationsignalcomes

(47)

4.1.2 Resultsandconclusions 41

Figure4.3:Shifted2.FacePCAand MFCPCAmutualinformation.Positive∆tmeans futurevoice. Darknessshowimportance.

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

We have evaluated the conjectured initial-energy densities corr in p+p and in Pb+Pb collisions at the currently available highest LHC energies, and compared them to

In this study we investigated ion channels related to connexins in patients with altered hearing and detected the expression level of Cx26, Cx43 and Kir2.1 channels in the

A három kommunikációs processzor (FEP) közötti kommunikációs protokoll megfelel a nemzetközi szab­. ványoknak, továbbá a kapu szoftverje a

Mügges blutige Panoramen, Heydens diegetisch sich wiederholende Bildsequenzen des Leidens sind aber nicht das einzige erzählerische Mittel, lediglich das Äußerste, durch welches

In the process of phosphate uptake, an increased capacity to transport bivalent cations is somehow generated.. This increased capacity then decays at a rate that is increased by

This leads to loss of H + ions from mitochondria and thereby delivers the chemical potential necessary to drive cation uptake (albeit specific membrane sites are involved).

However, Roter (1953) found that, whereas sodium was usually higher in the posterior regions of both experimental samples and controls, potassium concentrations were

Ebben a tanulmányban kimutatták, hogy a SEA-0400 dózis-hatás görbéjét legalább fél logaritmus értékkel tolta el a magasabb érzékenység felé, amikor a [Na + ] i