2.5.1 Direct ATVSandco-articulation
The mostcommonformofthelanguageisthepersonaltalkwhichisanaudiovisual speechprocess. Ourresearchisfocusedontherelationoftheaudioandthevisualpart oftalkingtobuildasystemconvertingvoicesignalintofaceanimation.
Co-articulationisthephenomenaoftransientphasesinspeechprocess.Inaudio modality,co-articulationistheeffectoftheneighboringphonemestotheactualstate ofspeechinashortwindowofthetime,shorterthanaphonemeduration.Inspeech synthesis,thereisastrongdemandtocreatenaturaltransientsbetweenthecleanstates ofspeech.Invisualspeechsynthesisthisissueisalsoimportant.Inthevisualspeech processtherearevisemesevenifthesynthesizerdoesnotexplicitlyusethisconcept. Visualco-articulationcanbedefinedasasystemofinfluencesbetweenvisemesin time. Becauseofbiologicallimitations,visualco-articulationisslowerthanaud ioco-articulation,butsimilarinother ways: neighboringvisemescanhaveeffectoneach
18 2. MOTIVATION AND THE BASESYSTEM
other,therearestrongervisemesthanothers,and mostofthecasescanbedescribed orapproximatedasaninterpolationofneighboringvisemes.
Let mecallasystemvisualspeechtransient model,ifitgenerates mediatestatesof visualspeechunits,suchasvisemes. Anexampleofvisualspeechtransientmodelisthe strictlyadoptedco-articulationconceptonvisemes,thevisualco-articulation,sincethe visemestringprocessinghastodecidehowinterpolationshouldtakeplacebetweenthe visemes. Anotherexampleofvisualspeechtransient modelsisthedirectconversion’s adaptationtolongertime windowsinordertoinclude morethanonephonemeon theaudio modality.Inthiscasethetransientsdependonacousticalproperties.In modularATVSsystems,thetransientsarecodedinrulesdependingonvisemestring neighborhoods.
Utilization
Trainingadirect ATVSneedsaudio-videodatapairs. Sinceplentyofspeechaudio databasesexistbutonlyafewaudiovisualones,buildingadirectATVSmeansbuilding amultimodaldatabasefirst. AdiscreteATVSisamodularsystem,itispossibletouse existingspeechdatabasestotrainvoicerecognition,andseparatelytraintheanimation partonphonemepairsortrigraphs[1]. ThereforedirectATVSneedsaspecialdatabase, butthesystemwillhandleenergyandrhythmnaturally,meanwhileadiscreteATVShas toreassemblethephonemesintoafluidcoarticulationchainofvisemeinterpolations. Letusetheterm“temportalscope”fortheoveralltimeofacoarticulationphenomena, whichmeansthatthestateofthemouthisdependingonthistimeintervalofthespeech signal.IndirectATVSthecalculationofaframeisbasedonthisaudiosignalinterval. IndiscreteATVSthevisemesandthephonemesaresynchronizedandinterpolationis appliedbetweenthem,asitispopularintexttovisualspeechsystems[22].Figure2.3 showsthisdifference.
Figure2.3: Temporalscopeofdiscrete(interpolating)anddirectATVS
Asymmetry
As mutualinformationestimationresultedanygivenstateofthevideodatastream canbecalculatedfairlyonadefinablerelativetimewindowofthespeechsignal. This modelpredictsthatthetransientphaseofthevisiblespeechcanbecalculatedinthe samewayasinthesteadyphaseasFigure2.3shows.
Thismodelgivesapredictionabouttemporalasymmetriesinthemultimodalspeech process. Thisasymmetrycanbeexplainedwith mentalpredictivityinthe motionof
2.5.2Evaluation 19
thefacial musclestofluentlyformthenextphoneme. Detailswillfollowinchapter“Temporalasymmetry”. Speakerindependence
Sincethedirectconversionisusuallyanapproximationtrainedonagivensetofaudio andvideostates,itsuffersheavydependenceonthedatabase. AsIdetailedbefore,for agooddirectATVSagoodlip-speakerneededtosharevisualdatawiththesystem. Talentedlip-speakersarerare,and mostoftheexperiencedlip-speakersare women. Thismeansthatasinglerecordingofonelip-speakergivesnotonlyaspeakerdependent system,butcollecting moreprofessionallip-speakerswouldresultagenderdependent system,sincethestatisticsofthedatawouldheavilybiased,oritisverydifficultto collectenough malelip-speakertothesystem.
Evenifwewouldhaveplentyofprofessionallip-speakers,thereisaquestionabout the mixingthevideodata. Peoplearticulatedifferently.Itisnotguaranteedthata mixtureofgoodarticulationsresultevenanacceptablearticulation. The mostsafe solutionistochooseoneofthelip-speakersasaguaranteedhighqualityarticulation, andtryingtousehis/herperformancewith multiplevoices.
Iwillgiveasolutionforthisprobleminchapter“Speakerindependenceindirect conversion”.
2.5.2 Evaluation
Thebasesystemwaspublishedasastandalonesystem,andwasmeasuredw ithsubjec-tiveopinionscoresandintelligibilitytestswithdeafpersons.In mychapteraboutthe comparisonofAV mappings,IwillpositionthedirectATVSamongtheothersusedin theworld.
Oddlytherearequitefewpublicationsondirect ATVS. Thisisstrange,because thesystemisoneofthe mostsimpledesigns.Let metellapersonalexperiencefroma conferenceofEUSIPCO,Florence. AyoungresearcherwasinterestedinourJohnnie demo. He wasfrom ATR,Japan,andhepraisedoursystem. AsIexplainedthe workflowofthesystem,oneachstagehesaid“Wedidthesame”. Eventhenumber of PCAcoefficients wasthesame. Attheend,hesaidthattheirsystemproduce significantlyworseresultsthanours,itwasnotevenpublishedbecauseitwasflawed. Weagreedthenthatthe mostimportantdifferenceisthelip-speaker’sprofessionality.
Hisworkcanbereadinjapanese[23]intheannualreportoftheinstitute,bythe wayinthesameyearwepublishedourresultsinHungarian[24,25,26]
Anotherexampleofdirectconversionpublicityisacomparisonstudyo fanunpub-lisheddirectsystem[27]usedasinternalbaseline.
Becauseofthisunderpublicity,itisimportanttopositiondirectATVSamongthe morepopularmodularATVSsystems,sincemostofthevisualspeechsynthesisresearch groupsalsotrytoimplementadirectATVS,buttheireffortsfailbecauseofthequality ofthedatabase. Atthefirstglancethis mayseemasbadnewsforourresearch,but thenoveltyofoursystemisstillunharmedsincetheworkinATRwasidenticalonlyin
thetechnicaldetails,andatrainingsystem’stechnologyitself,withoutthedatabaseis notawholesystem. Ourbasesystemisnewbecauseofthenewtrainingdata,andthe findingoftheneedoftheprofessionallip-speaker. Again,Iwouldliketoemphasizethat differencebetweenourbasesystemandtheonedevelopedin ATRisnot“only”the databasebutthetrainingstrategy,whichisoneofthemostimportantandfundamental partofanylearningsystem.
Thisnewandsuccessfulllearningstrategy makesourbasesystemnovel,butin thisthesisIfocustheresultsof myown,nottheresearchgroup. Johnnie Talker isacontributionofthegroup,andthefollowingextensionsand measurementsare contributionoftheauthorofthisthesis.
InthenextchapterIwillshowhowthebasesystemwiththeessentialdatabaseof theprofessionallip-speakercanberankedamongthewidelyusedATVSsystems.
20