Speakerindependenceind irect convers ion - Audiotovisualspeechconversion GergelyFeldhoﬀer

Thischapterisabouthandlingspeakerdependenceofdirectlearningapproach.

5 .1 Method

Inthischapteraspeakerindependenttraining methodispresentedfordirect ATVS systems. Anaudiovisualdatabasewith multiplevoicesandonlyonespeaker’svideo informationwascreatedusingdynamictimewarping. Thevideoinformationisaligned to morespeakers’voice. Theﬁtis measuredwithsubjectiveandobjectivetests.Su it-abilityofimplementationson mobiledevicesisdiscussed.

5.1.1 Introduction

Thedirect ATVSneedanaudiovisualdatabasewhichcontainsaudioandvideodata ofaspeakingface[44]. Thesystem willbetrainedonthisdata,soifthereisonly oneperson’svoiceandfaceinthedatabase,thesystem willbespeakerdependent. Forspeakerindependencethedatabaseshouldcontain morepersons’voice,covering as manyvoicecharacteristicsaspossible(seeFig5.1). Butourtaskistocalculate onlyone,butlip-readableface. Trainingon multiplespeaker’svoicesandfacesresults changingfaceondiﬀerentvoices,andpoorlipreadabilitybecauseofthelackofthe talentof manypeople. We madeatestwithdeafpersons,andthelip-readabilityof videoclipsisaﬀectedmostlybythetrainingperson’stalent,andanyofthevideoquality measuresaspicturesize,resolutionorframe/secfrequencyaﬀectedless. Thereforewe askedprofessionallip-speakerstoappearinourdatabase. Forspeakerindependence thesystemneeds morevoicerecordingfromdiﬀerentpeople. Tosynthesizeonel ip-readablefaceneedsonlyoneperson’svideodata.SotocreatedirectATVSthe main problemisto matchtheaudiodataof manypersonswithvideodataofoneperson.

Becauseoftheuseof multiplevisualspeechdatafrom multiplesourceswouldrise theproblemofinconsistentarticulation,wedecidedtoenhancethedatabasebyadding

52 5. SPEAKERINDEPENDENCEIN DIRECT CONVERSION

Figure5.1: Overtraining:thenetworklearnstrainingsetdependentdetails. Thetrain andtestrunswereindependent.

5.1.2Speakerindependence 53

Figure5.2:Iterationsofalignment. Notethattherearefeatureswhichneedmorethan oneiterationofalignment.

audiocontentwithoutvideocontent,andtryingto matchrecordeddataifthedesired visualspeechstateisthesamefor moreaudiosamples. Inother words, wecreate trainingsamplesas“Howaprofessionallip-speakerwouldvisuallyarticulatethis”for eachaudiotimewindow.

I willusea methodbasedon Dynamic Time Warping(DTW)[45]toalignthe audio modalitiesofdiﬀerentoccurrencesofthesamesentence. DTWisoriginally usedforASRpurposesonsmallvocabularysystems. Thisisanexampleofdynamic programmingforspeechaudio.

ApplyingDTWfortwoaudiosignalswillresultinasuboptimalalignmentsequence, howthesignalsshouldbewarpedintimetohavethe maximumcoherencewitheach other. DTWhassomeparameterswhichrestrictsthepossiblestepsinthetimewarping, forexampleitisforbiddeninsomesystemstoomit morethanonesampleinarow. Theserestrictionsguaranteetheavoidanceofillsolutions,like“omiteverythingand theninserteverything”. Ontheotherhand,thealignmentwillbesuboptimal.

IhaveusediterativerestrictiveDTWapplicationonthesamples.Ineachturnthe alignmentwasvalid,andtheprocessconvergedtoanacceptablealignment. SeeFig 5.2.

5.1.2 Speakerindependence

Thedescribedbasesystemworksonwelldeﬁnedpairsofaudioandvideodata. This isevidentifthedatabaseisasinglepersondatabase.Ifthevideodatabelongstoa

54 5. SPEAKERINDEPENDENCEIN DIRECT CONVERSION

Figure5.3: Meanvalueandstandarddeviationofscoresoftestvideos.

diﬀerentperson,thetaskistoﬁttheaudioandthevideodatatogether. Thetext ofthedatabasewasthesameforeachperson. Thisallowsthealigningofaudiodata betweenspeakers.

Thisabovedescribedmatchingisrepresentedbyindexarrayswhichtellthatspeaker Ainthei momentsaysthesameasspeakerBinthej moment. Aslongastheaudio andvideodataofthespeakersaresynchronized,thisgivestheinformationofhow speakerBholdshis mouthwhenhesaysthesameasspeakerAspeaksinthe moment i. Withthistrainingdatawecanhaveonlyoneperson’svideoinformationwhichis fromaprofessionallip-speakerandinthesametimethevoicecharacteristicscanbe coveredwith multiplespeakers’voices.

Subjectivevalidation

TheDTWgivenindiceswereusedtocreatetestvideos. Foraudiosignalsofspeaker A,BandCwecreatedvideoclipsfromtheFPcoordinatesofspeakerA.Thevideos ofA-Acasesweretheoriginalframesoftherecording,andinthecaseofBandCthe MPEG-4FPcoordinatesofspeakerAwere mappedbyDTWonthevoice.Sincethe DTW mappedvideoclipscontainsframedoublingwhichfeelserratic,alloftheclips wassmoothedwithawindowoftheneighboring1-1frames. Weasked21peopletotell whethertheclipsareoriginalrecordingsordubbed.Theyhadtogivescores,5forthe original,1forthedubbed,3inthecaseofuncertainty.

AsitcanbeseenonFig. 5.3. thedeviationsareoverlappingeachother,there

5.1.3 Conclusion 55

Figure5.4: TrainingwithspeakerA,AandB,andsoon,andalwaystestbyspeaker Ewhichisnotinvolvedinthetrainingset.

areevenbetterscored modiﬁedclipsthansomeoftheoriginals. Theaveragescore oforiginalvideosis4.2,the modiﬁedis3.2. Wetreatthisasagoodresultsincethe averagescoreofthe modiﬁedvideosareabovethe”‘uncertain”’score.

Objectivevalidation

A measurementofspeakerindependenceistestingthesystemwithdatawhichisnot inthetrainingsetoftheneuralnetwork. Theunitofthe measurementerrorisin pixel. Thereasonofthisisthevideoanalysis,wheretheerrorofthecontourdetection isabout1pixel. Thisistheupperlimitofthepracticalprecision.40sentencesof5 speakerswereusedforthisexperiment. WeusedthevideoinformationofspeakerAas outputforeachspeaker,sointhecaseofspeakerB,C,DandEthevideoinformation iswarpedontothevoice. WeusedspeakerEastestreference.

First,wetestedtheoriginalvoiceandvideocombination, wherethediﬀerenceof thetrainingwas moderate,theaverageerrorwas1.5pixels. Whenweinvolved more speakers’sdatainthetrainingset,thetestingerrordecreasedtoabout1pixel,which isourprecisionlimitinthedatabase.SeeFig.5.4

5.1.3 Conclusion

AspeakerindependentATVSispresented.Subjectiveandobjectivetestsconﬁrmthe suﬃcientsuitabilityofthe DTWontrainingdatapreparing.Itispossibletotrain

thesystemwithonlyvoicetobroadenthecoverofvoicecharacteristics. Mostofthe calculationsofdirect ATVSarecheapenoughtoimplementthesystemon mobile devices. Thespeakerindependenceinducesnoplusexpenseontheclientside.

5 .2 Thes is

III.Idevelopedatime warpingbased AVsynchronizing method tocreatetrainingsamplesfordirect AV mapping.Ishowedthat theprecisionofthetraineddirect AV mappingsystemincreases witheachaddedtrainingsamplesetontest material whichisnot includedinthetrainingdatabase.[46]

5.2.1 Novelty

SpeakerindependenceinATVSisusuallyhandledasanASRissue,since mostofthe ATVSsystemsaremodularATVS,andASRsystemsarewel lpreparedforspeakerinde-pendencechallenges.Inthisworkaspeakerindependenceenhancementwasdescribed whichcanbeusedindirectconversion.

5.2.2 Measurements

Subjectiveandobjective measurementsweredone. Thesystemwasdr ivenbyanun-knownspeaker,andtheresponsewastested.Intheobjectivetestaneuralnetworkwas trainedonmoreandmoredatawhichwereproducedbythedescribedmethod,andtest errorwasmeasuredwiththeunknownspeaker.Inthesubjectivetestthetrainingdata itselfwastested.Listenerswereinstructedtotellifthevideoisdubbedororiginal. 5.2.3 Limitsofvalidity

The methodisvulnerabletopronouncation mistakes,theaudioonlyspeakershaveto sayeverythingjustliketheoriginallip-speaker,becauseifthedynamicprogramming algorithmlosethesynchronybetweenthesamples,seriouserrorswillbeincludedin theresultingtrainingdatabase.

5.2.4 Consequences

Thisisa method whichgreatlyenhanceaquality withoutanyrun-timepenalties. DirectATVSsystemsshouldusethe methodalways.

Chapter6

In document Audiotovisualspeechconversion GergelyFeldhoﬀer (Pldal 57-63)