Thischapterisabouthandlingspeakerdependenceofdirectlearningapproach.
5 .1 Method
Inthischapteraspeakerindependenttraining methodispresentedfordirect ATVS systems. Anaudiovisualdatabasewith multiplevoicesandonlyonespeaker’svideo informationwascreatedusingdynamictimewarping. Thevideoinformationisaligned to morespeakers’voice. Thefitis measuredwithsubjectiveandobjectivetests.Su it-abilityofimplementationson mobiledevicesisdiscussed.
5.1.1 Introduction
Thedirect ATVSneedanaudiovisualdatabasewhichcontainsaudioandvideodata ofaspeakingface[44]. Thesystem willbetrainedonthisdata,soifthereisonly oneperson’svoiceandfaceinthedatabase,thesystem willbespeakerdependent. Forspeakerindependencethedatabaseshouldcontain morepersons’voice,covering as manyvoicecharacteristicsaspossible(seeFig5.1). Butourtaskistocalculate onlyone,butlip-readableface. Trainingon multiplespeaker’svoicesandfacesresults changingfaceondifferentvoices,andpoorlipreadabilitybecauseofthelackofthe talentof manypeople. We madeatestwithdeafpersons,andthelip-readabilityof videoclipsisaffectedmostlybythetrainingperson’stalent,andanyofthevideoquality measuresaspicturesize,resolutionorframe/secfrequencyaffectedless. Thereforewe askedprofessionallip-speakerstoappearinourdatabase. Forspeakerindependence thesystemneeds morevoicerecordingfromdifferentpeople. Tosynthesizeonel ip-readablefaceneedsonlyoneperson’svideodata.SotocreatedirectATVSthe main problemisto matchtheaudiodataof manypersonswithvideodataofoneperson.
Becauseoftheuseof multiplevisualspeechdatafrom multiplesourceswouldrise theproblemofinconsistentarticulation,wedecidedtoenhancethedatabasebyadding
51
52 5. SPEAKERINDEPENDENCEIN DIRECT CONVERSION
Figure5.1: Overtraining:thenetworklearnstrainingsetdependentdetails. Thetrain andtestrunswereindependent.
5.1.2Speakerindependence 53
Figure5.2:Iterationsofalignment. Notethattherearefeatureswhichneedmorethan oneiterationofalignment.
audiocontentwithoutvideocontent,andtryingto matchrecordeddataifthedesired visualspeechstateisthesamefor moreaudiosamples. Inother words, wecreate trainingsamplesas“Howaprofessionallip-speakerwouldvisuallyarticulatethis”for eachaudiotimewindow.
I willusea methodbasedon Dynamic Time Warping(DTW)[45]toalignthe audio modalitiesofdifferentoccurrencesofthesamesentence. DTWisoriginally usedforASRpurposesonsmallvocabularysystems. Thisisanexampleofdynamic programmingforspeechaudio.
ApplyingDTWfortwoaudiosignalswillresultinasuboptimalalignmentsequence, howthesignalsshouldbewarpedintimetohavethe maximumcoherencewitheach other. DTWhassomeparameterswhichrestrictsthepossiblestepsinthetimewarping, forexampleitisforbiddeninsomesystemstoomit morethanonesampleinarow. Theserestrictionsguaranteetheavoidanceofillsolutions,like“omiteverythingand theninserteverything”. Ontheotherhand,thealignmentwillbesuboptimal.
IhaveusediterativerestrictiveDTWapplicationonthesamples.Ineachturnthe alignmentwasvalid,andtheprocessconvergedtoanacceptablealignment. SeeFig 5.2.
5.1.2 Speakerindependence
Thedescribedbasesystemworksonwelldefinedpairsofaudioandvideodata. This isevidentifthedatabaseisasinglepersondatabase.Ifthevideodatabelongstoa
54 5. SPEAKERINDEPENDENCEIN DIRECT CONVERSION
Figure5.3: Meanvalueandstandarddeviationofscoresoftestvideos.
differentperson,thetaskistofittheaudioandthevideodatatogether. Thetext ofthedatabasewasthesameforeachperson. Thisallowsthealigningofaudiodata betweenspeakers.
Thisabovedescribedmatchingisrepresentedbyindexarrayswhichtellthatspeaker Ainthei momentsaysthesameasspeakerBinthej moment. Aslongastheaudio andvideodataofthespeakersaresynchronized,thisgivestheinformationofhow speakerBholdshis mouthwhenhesaysthesameasspeakerAspeaksinthe moment i. Withthistrainingdatawecanhaveonlyoneperson’svideoinformationwhichis fromaprofessionallip-speakerandinthesametimethevoicecharacteristicscanbe coveredwith multiplespeakers’voices.
Subjectivevalidation
TheDTWgivenindiceswereusedtocreatetestvideos. Foraudiosignalsofspeaker A,BandCwecreatedvideoclipsfromtheFPcoordinatesofspeakerA.Thevideos ofA-Acasesweretheoriginalframesoftherecording,andinthecaseofBandCthe MPEG-4FPcoordinatesofspeakerAwere mappedbyDTWonthevoice.Sincethe DTW mappedvideoclipscontainsframedoublingwhichfeelserratic,alloftheclips wassmoothedwithawindowoftheneighboring1-1frames. Weasked21peopletotell whethertheclipsareoriginalrecordingsordubbed.Theyhadtogivescores,5forthe original,1forthedubbed,3inthecaseofuncertainty.
AsitcanbeseenonFig. 5.3. thedeviationsareoverlappingeachother,there
5.1.3 Conclusion 55
Figure5.4: TrainingwithspeakerA,AandB,andsoon,andalwaystestbyspeaker Ewhichisnotinvolvedinthetrainingset.
areevenbetterscored modifiedclipsthansomeoftheoriginals. Theaveragescore oforiginalvideosis4.2,the modifiedis3.2. Wetreatthisasagoodresultsincethe averagescoreofthe modifiedvideosareabovethe”‘uncertain”’score.
Objectivevalidation
A measurementofspeakerindependenceistestingthesystemwithdatawhichisnot inthetrainingsetoftheneuralnetwork. Theunitofthe measurementerrorisin pixel. Thereasonofthisisthevideoanalysis,wheretheerrorofthecontourdetection isabout1pixel. Thisistheupperlimitofthepracticalprecision.40sentencesof5 speakerswereusedforthisexperiment. WeusedthevideoinformationofspeakerAas outputforeachspeaker,sointhecaseofspeakerB,C,DandEthevideoinformation iswarpedontothevoice. WeusedspeakerEastestreference.
First,wetestedtheoriginalvoiceandvideocombination, wherethedifferenceof thetrainingwas moderate,theaverageerrorwas1.5pixels. Whenweinvolved more speakers’sdatainthetrainingset,thetestingerrordecreasedtoabout1pixel,which isourprecisionlimitinthedatabase.SeeFig.5.4
5.1.3 Conclusion
AspeakerindependentATVSispresented.Subjectiveandobjectivetestsconfirmthe sufficientsuitabilityofthe DTWontrainingdatapreparing.Itispossibletotrain
thesystemwithonlyvoicetobroadenthecoverofvoicecharacteristics. Mostofthe calculationsofdirect ATVSarecheapenoughtoimplementthesystemon mobile devices. Thespeakerindependenceinducesnoplusexpenseontheclientside.
5 .2 Thes is
III.Idevelopedatime warpingbased AVsynchronizing method tocreatetrainingsamplesfordirect AV mapping.Ishowedthat theprecisionofthetraineddirect AV mappingsystemincreases witheachaddedtrainingsamplesetontest material whichisnot includedinthetrainingdatabase.[46]
5.2.1 Novelty
SpeakerindependenceinATVSisusuallyhandledasanASRissue,since mostofthe ATVSsystemsaremodularATVS,andASRsystemsarewel lpreparedforspeakerinde-pendencechallenges.Inthisworkaspeakerindependenceenhancementwasdescribed whichcanbeusedindirectconversion.
5.2.2 Measurements
Subjectiveandobjective measurementsweredone. Thesystemwasdr ivenbyanun-knownspeaker,andtheresponsewastested.Intheobjectivetestaneuralnetworkwas trainedonmoreandmoredatawhichwereproducedbythedescribedmethod,andtest errorwasmeasuredwiththeunknownspeaker.Inthesubjectivetestthetrainingdata itselfwastested.Listenerswereinstructedtotellifthevideoisdubbedororiginal. 5.2.3 Limitsofvalidity
The methodisvulnerabletopronouncation mistakes,theaudioonlyspeakershaveto sayeverythingjustliketheoriginallip-speaker,becauseifthedynamicprogramming algorithmlosethesynchronybetweenthesamples,seriouserrorswillbeincludedin theresultingtrainingdatabase.
5.2.4 Consequences
Thisisa method whichgreatlyenhanceaquality withoutanyrun-timepenalties. DirectATVSsystemsshouldusethe methodalways.
56