• Nem Talált Eredményt

Speakerindependenceind irect convers ion

Thischapterisabouthandlingspeakerdependenceofdirectlearningapproach.

5 .1 Method

Inthischapteraspeakerindependenttraining methodispresentedfordirect ATVS systems. Anaudiovisualdatabasewith multiplevoicesandonlyonespeaker’svideo informationwascreatedusingdynamictimewarping. Thevideoinformationisaligned to morespeakers’voice. Thefitis measuredwithsubjectiveandobjectivetests.Su it-abilityofimplementationson mobiledevicesisdiscussed.

5.1.1 Introduction

Thedirect ATVSneedanaudiovisualdatabasewhichcontainsaudioandvideodata ofaspeakingface[44]. Thesystem willbetrainedonthisdata,soifthereisonly oneperson’svoiceandfaceinthedatabase,thesystem willbespeakerdependent. Forspeakerindependencethedatabaseshouldcontain morepersons’voice,covering as manyvoicecharacteristicsaspossible(seeFig5.1). Butourtaskistocalculate onlyone,butlip-readableface. Trainingon multiplespeaker’svoicesandfacesresults changingfaceondifferentvoices,andpoorlipreadabilitybecauseofthelackofthe talentof manypeople. We madeatestwithdeafpersons,andthelip-readabilityof videoclipsisaffectedmostlybythetrainingperson’stalent,andanyofthevideoquality measuresaspicturesize,resolutionorframe/secfrequencyaffectedless. Thereforewe askedprofessionallip-speakerstoappearinourdatabase. Forspeakerindependence thesystemneeds morevoicerecordingfromdifferentpeople. Tosynthesizeonel ip-readablefaceneedsonlyoneperson’svideodata.SotocreatedirectATVSthe main problemisto matchtheaudiodataof manypersonswithvideodataofoneperson.

Becauseoftheuseof multiplevisualspeechdatafrom multiplesourceswouldrise theproblemofinconsistentarticulation,wedecidedtoenhancethedatabasebyadding

51

52 5. SPEAKERINDEPENDENCEIN DIRECT CONVERSION

Figure5.1: Overtraining:thenetworklearnstrainingsetdependentdetails. Thetrain andtestrunswereindependent.

5.1.2Speakerindependence 53

Figure5.2:Iterationsofalignment. Notethattherearefeatureswhichneedmorethan oneiterationofalignment.

audiocontentwithoutvideocontent,andtryingto matchrecordeddataifthedesired visualspeechstateisthesamefor moreaudiosamples. Inother words, wecreate trainingsamplesas“Howaprofessionallip-speakerwouldvisuallyarticulatethis”for eachaudiotimewindow.

I willusea methodbasedon Dynamic Time Warping(DTW)[45]toalignthe audio modalitiesofdifferentoccurrencesofthesamesentence. DTWisoriginally usedforASRpurposesonsmallvocabularysystems. Thisisanexampleofdynamic programmingforspeechaudio.

ApplyingDTWfortwoaudiosignalswillresultinasuboptimalalignmentsequence, howthesignalsshouldbewarpedintimetohavethe maximumcoherencewitheach other. DTWhassomeparameterswhichrestrictsthepossiblestepsinthetimewarping, forexampleitisforbiddeninsomesystemstoomit morethanonesampleinarow. Theserestrictionsguaranteetheavoidanceofillsolutions,like“omiteverythingand theninserteverything”. Ontheotherhand,thealignmentwillbesuboptimal.

IhaveusediterativerestrictiveDTWapplicationonthesamples.Ineachturnthe alignmentwasvalid,andtheprocessconvergedtoanacceptablealignment. SeeFig 5.2.

5.1.2 Speakerindependence

Thedescribedbasesystemworksonwelldefinedpairsofaudioandvideodata. This isevidentifthedatabaseisasinglepersondatabase.Ifthevideodatabelongstoa

54 5. SPEAKERINDEPENDENCEIN DIRECT CONVERSION

Figure5.3: Meanvalueandstandarddeviationofscoresoftestvideos.

differentperson,thetaskistofittheaudioandthevideodatatogether. Thetext ofthedatabasewasthesameforeachperson. Thisallowsthealigningofaudiodata betweenspeakers.

Thisabovedescribedmatchingisrepresentedbyindexarrayswhichtellthatspeaker Ainthei momentsaysthesameasspeakerBinthej moment. Aslongastheaudio andvideodataofthespeakersaresynchronized,thisgivestheinformationofhow speakerBholdshis mouthwhenhesaysthesameasspeakerAspeaksinthe moment i. Withthistrainingdatawecanhaveonlyoneperson’svideoinformationwhichis fromaprofessionallip-speakerandinthesametimethevoicecharacteristicscanbe coveredwith multiplespeakers’voices.

Subjectivevalidation

TheDTWgivenindiceswereusedtocreatetestvideos. Foraudiosignalsofspeaker A,BandCwecreatedvideoclipsfromtheFPcoordinatesofspeakerA.Thevideos ofA-Acasesweretheoriginalframesoftherecording,andinthecaseofBandCthe MPEG-4FPcoordinatesofspeakerAwere mappedbyDTWonthevoice.Sincethe DTW mappedvideoclipscontainsframedoublingwhichfeelserratic,alloftheclips wassmoothedwithawindowoftheneighboring1-1frames. Weasked21peopletotell whethertheclipsareoriginalrecordingsordubbed.Theyhadtogivescores,5forthe original,1forthedubbed,3inthecaseofuncertainty.

AsitcanbeseenonFig. 5.3. thedeviationsareoverlappingeachother,there

5.1.3 Conclusion 55

Figure5.4: TrainingwithspeakerA,AandB,andsoon,andalwaystestbyspeaker Ewhichisnotinvolvedinthetrainingset.

areevenbetterscored modifiedclipsthansomeoftheoriginals. Theaveragescore oforiginalvideosis4.2,the modifiedis3.2. Wetreatthisasagoodresultsincethe averagescoreofthe modifiedvideosareabovethe”‘uncertain”’score.

Objectivevalidation

A measurementofspeakerindependenceistestingthesystemwithdatawhichisnot inthetrainingsetoftheneuralnetwork. Theunitofthe measurementerrorisin pixel. Thereasonofthisisthevideoanalysis,wheretheerrorofthecontourdetection isabout1pixel. Thisistheupperlimitofthepracticalprecision.40sentencesof5 speakerswereusedforthisexperiment. WeusedthevideoinformationofspeakerAas outputforeachspeaker,sointhecaseofspeakerB,C,DandEthevideoinformation iswarpedontothevoice. WeusedspeakerEastestreference.

First,wetestedtheoriginalvoiceandvideocombination, wherethedifferenceof thetrainingwas moderate,theaverageerrorwas1.5pixels. Whenweinvolved more speakers’sdatainthetrainingset,thetestingerrordecreasedtoabout1pixel,which isourprecisionlimitinthedatabase.SeeFig.5.4

5.1.3 Conclusion

AspeakerindependentATVSispresented.Subjectiveandobjectivetestsconfirmthe sufficientsuitabilityofthe DTWontrainingdatapreparing.Itispossibletotrain

thesystemwithonlyvoicetobroadenthecoverofvoicecharacteristics. Mostofthe calculationsofdirect ATVSarecheapenoughtoimplementthesystemon mobile devices. Thespeakerindependenceinducesnoplusexpenseontheclientside.

5 .2 Thes is

III.Idevelopedatime warpingbased AVsynchronizing method tocreatetrainingsamplesfordirect AV mapping.Ishowedthat theprecisionofthetraineddirect AV mappingsystemincreases witheachaddedtrainingsamplesetontest material whichisnot includedinthetrainingdatabase.[46]

5.2.1 Novelty

SpeakerindependenceinATVSisusuallyhandledasanASRissue,since mostofthe ATVSsystemsaremodularATVS,andASRsystemsarewel lpreparedforspeakerinde-pendencechallenges.Inthisworkaspeakerindependenceenhancementwasdescribed whichcanbeusedindirectconversion.

5.2.2 Measurements

Subjectiveandobjective measurementsweredone. Thesystemwasdr ivenbyanun-knownspeaker,andtheresponsewastested.Intheobjectivetestaneuralnetworkwas trainedonmoreandmoredatawhichwereproducedbythedescribedmethod,andtest errorwasmeasuredwiththeunknownspeaker.Inthesubjectivetestthetrainingdata itselfwastested.Listenerswereinstructedtotellifthevideoisdubbedororiginal. 5.2.3 Limitsofvalidity

The methodisvulnerabletopronouncation mistakes,theaudioonlyspeakershaveto sayeverythingjustliketheoriginallip-speaker,becauseifthedynamicprogramming algorithmlosethesynchronybetweenthesamples,seriouserrorswillbeincludedin theresultingtrainingdatabase.

5.2.4 Consequences

Thisisa method whichgreatlyenhanceaquality withoutanyrun-timepenalties. DirectATVSsystemsshouldusethe methodalways.

56

Chapter6