• Nem Talált Eredményt

2 .5 Extend ingd irectconvers ion

2.5.1 Direct ATVSandco-articulation

The mostcommonformofthelanguageisthepersonaltalkwhichisanaudiovisual speechprocess. Ourresearchisfocusedontherelationoftheaudioandthevisualpart oftalkingtobuildasystemconvertingvoicesignalintofaceanimation.

Co-articulationisthephenomenaoftransientphasesinspeechprocess.Inaudio modality,co-articulationistheeffectoftheneighboringphonemestotheactualstate ofspeechinashortwindowofthetime,shorterthanaphonemeduration.Inspeech synthesis,thereisastrongdemandtocreatenaturaltransientsbetweenthecleanstates ofspeech.Invisualspeechsynthesisthisissueisalsoimportant.Inthevisualspeech processtherearevisemesevenifthesynthesizerdoesnotexplicitlyusethisconcept. Visualco-articulationcanbedefinedasasystemofinfluencesbetweenvisemesin time. Becauseofbiologicallimitations,visualco-articulationisslowerthanaud ioco-articulation,butsimilarinother ways: neighboringvisemescanhaveeffectoneach

18 2. MOTIVATION AND THE BASESYSTEM

other,therearestrongervisemesthanothers,and mostofthecasescanbedescribed orapproximatedasaninterpolationofneighboringvisemes.

Let mecallasystemvisualspeechtransient model,ifitgenerates mediatestatesof visualspeechunits,suchasvisemes. Anexampleofvisualspeechtransientmodelisthe strictlyadoptedco-articulationconceptonvisemes,thevisualco-articulation,sincethe visemestringprocessinghastodecidehowinterpolationshouldtakeplacebetweenthe visemes. Anotherexampleofvisualspeechtransient modelsisthedirectconversion’s adaptationtolongertime windowsinordertoinclude morethanonephonemeon theaudio modality.Inthiscasethetransientsdependonacousticalproperties.In modularATVSsystems,thetransientsarecodedinrulesdependingonvisemestring neighborhoods.

Utilization

Trainingadirect ATVSneedsaudio-videodatapairs. Sinceplentyofspeechaudio databasesexistbutonlyafewaudiovisualones,buildingadirectATVSmeansbuilding amultimodaldatabasefirst. AdiscreteATVSisamodularsystem,itispossibletouse existingspeechdatabasestotrainvoicerecognition,andseparatelytraintheanimation partonphonemepairsortrigraphs[1]. ThereforedirectATVSneedsaspecialdatabase, butthesystemwillhandleenergyandrhythmnaturally,meanwhileadiscreteATVShas toreassemblethephonemesintoafluidcoarticulationchainofvisemeinterpolations. Letusetheterm“temportalscope”fortheoveralltimeofacoarticulationphenomena, whichmeansthatthestateofthemouthisdependingonthistimeintervalofthespeech signal.IndirectATVSthecalculationofaframeisbasedonthisaudiosignalinterval. IndiscreteATVSthevisemesandthephonemesaresynchronizedandinterpolationis appliedbetweenthem,asitispopularintexttovisualspeechsystems[22].Figure2.3 showsthisdifference.

Figure2.3: Temporalscopeofdiscrete(interpolating)anddirectATVS

Asymmetry

As mutualinformationestimationresultedanygivenstateofthevideodatastream canbecalculatedfairlyonadefinablerelativetimewindowofthespeechsignal. This modelpredictsthatthetransientphaseofthevisiblespeechcanbecalculatedinthe samewayasinthesteadyphaseasFigure2.3shows.

Thismodelgivesapredictionabouttemporalasymmetriesinthemultimodalspeech process. Thisasymmetrycanbeexplainedwith mentalpredictivityinthe motionof

2.5.2Evaluation 19

thefacial musclestofluentlyformthenextphoneme. Detailswillfollowinchapter“Temporalasymmetry”. Speakerindependence

Sincethedirectconversionisusuallyanapproximationtrainedonagivensetofaudio andvideostates,itsuffersheavydependenceonthedatabase. AsIdetailedbefore,for agooddirectATVSagoodlip-speakerneededtosharevisualdatawiththesystem. Talentedlip-speakersarerare,and mostoftheexperiencedlip-speakersare women. Thismeansthatasinglerecordingofonelip-speakergivesnotonlyaspeakerdependent system,butcollecting moreprofessionallip-speakerswouldresultagenderdependent system,sincethestatisticsofthedatawouldheavilybiased,oritisverydifficultto collectenough malelip-speakertothesystem.

Evenifwewouldhaveplentyofprofessionallip-speakers,thereisaquestionabout the mixingthevideodata. Peoplearticulatedifferently.Itisnotguaranteedthata mixtureofgoodarticulationsresultevenanacceptablearticulation. The mostsafe solutionistochooseoneofthelip-speakersasaguaranteedhighqualityarticulation, andtryingtousehis/herperformancewith multiplevoices.

Iwillgiveasolutionforthisprobleminchapter“Speakerindependenceindirect conversion”.

2.5.2 Evaluation

Thebasesystemwaspublishedasastandalonesystem,andwasmeasuredw ithsubjec-tiveopinionscoresandintelligibilitytestswithdeafpersons.In mychapteraboutthe comparisonofAV mappings,IwillpositionthedirectATVSamongtheothersusedin theworld.

Oddlytherearequitefewpublicationsondirect ATVS. Thisisstrange,because thesystemisoneofthe mostsimpledesigns.Let metellapersonalexperiencefroma conferenceofEUSIPCO,Florence. AyoungresearcherwasinterestedinourJohnnie demo. He wasfrom ATR,Japan,andhepraisedoursystem. AsIexplainedthe workflowofthesystem,oneachstagehesaid“Wedidthesame”. Eventhenumber of PCAcoefficients wasthesame. Attheend,hesaidthattheirsystemproduce significantlyworseresultsthanours,itwasnotevenpublishedbecauseitwasflawed. Weagreedthenthatthe mostimportantdifferenceisthelip-speaker’sprofessionality.

Hisworkcanbereadinjapanese[23]intheannualreportoftheinstitute,bythe wayinthesameyearwepublishedourresultsinHungarian[24,25,26]

Anotherexampleofdirectconversionpublicityisacomparisonstudyo fanunpub-lisheddirectsystem[27]usedasinternalbaseline.

Becauseofthisunderpublicity,itisimportanttopositiondirectATVSamongthe morepopularmodularATVSsystems,sincemostofthevisualspeechsynthesisresearch groupsalsotrytoimplementadirectATVS,buttheireffortsfailbecauseofthequality ofthedatabase. Atthefirstglancethis mayseemasbadnewsforourresearch,but thenoveltyofoursystemisstillunharmedsincetheworkinATRwasidenticalonlyin

thetechnicaldetails,andatrainingsystem’stechnologyitself,withoutthedatabaseis notawholesystem. Ourbasesystemisnewbecauseofthenewtrainingdata,andthe findingoftheneedoftheprofessionallip-speaker. Again,Iwouldliketoemphasizethat differencebetweenourbasesystemandtheonedevelopedin ATRisnot“only”the databasebutthetrainingstrategy,whichisoneofthemostimportantandfundamental partofanylearningsystem.

Thisnewandsuccessfulllearningstrategy makesourbasesystemnovel,butin thisthesisIfocustheresultsof myown,nottheresearchgroup. Johnnie Talker isacontributionofthegroup,andthefollowingextensionsand measurementsare contributionoftheauthorofthisthesis.

InthenextchapterIwillshowhowthebasesystemwiththeessentialdatabaseof theprofessionallip-speakercanberankedamongthewidelyusedATVSsystems.

20

Chapter3