2 .5 Extend ingd irectconvers ion - Audiotovisualspeechconversion GergelyFeldhoﬀer

2.5.1 Direct ATVSandco-articulation

The mostcommonformofthelanguageisthepersonaltalkwhichisanaudiovisual speechprocess. Ourresearchisfocusedontherelationoftheaudioandthevisualpart oftalkingtobuildasystemconvertingvoicesignalintofaceanimation.

Co-articulationisthephenomenaoftransientphasesinspeechprocess.Inaudio modality,co-articulationistheeﬀectoftheneighboringphonemestotheactualstate ofspeechinashortwindowofthetime,shorterthanaphonemeduration.Inspeech synthesis,thereisastrongdemandtocreatenaturaltransientsbetweenthecleanstates ofspeech.Invisualspeechsynthesisthisissueisalsoimportant.Inthevisualspeech processtherearevisemesevenifthesynthesizerdoesnotexplicitlyusethisconcept. Visualco-articulationcanbedeﬁnedasasystemofinﬂuencesbetweenvisemesin time. Becauseofbiologicallimitations,visualco-articulationisslowerthanaud ioco-articulation,butsimilarinother ways: neighboringvisemescanhaveeﬀectoneach

18 2. MOTIVATION AND THE BASESYSTEM

other,therearestrongervisemesthanothers,and mostofthecasescanbedescribed orapproximatedasaninterpolationofneighboringvisemes.

Let mecallasystemvisualspeechtransient model,ifitgenerates mediatestatesof visualspeechunits,suchasvisemes. Anexampleofvisualspeechtransientmodelisthe strictlyadoptedco-articulationconceptonvisemes,thevisualco-articulation,sincethe visemestringprocessinghastodecidehowinterpolationshouldtakeplacebetweenthe visemes. Anotherexampleofvisualspeechtransient modelsisthedirectconversion’s adaptationtolongertime windowsinordertoinclude morethanonephonemeon theaudio modality.Inthiscasethetransientsdependonacousticalproperties.In modularATVSsystems,thetransientsarecodedinrulesdependingonvisemestring neighborhoods.

Utilization

Trainingadirect ATVSneedsaudio-videodatapairs. Sinceplentyofspeechaudio databasesexistbutonlyafewaudiovisualones,buildingadirectATVSmeansbuilding amultimodaldatabaseﬁrst. AdiscreteATVSisamodularsystem,itispossibletouse existingspeechdatabasestotrainvoicerecognition,andseparatelytraintheanimation partonphonemepairsortrigraphs[1]. ThereforedirectATVSneedsaspecialdatabase, butthesystemwillhandleenergyandrhythmnaturally,meanwhileadiscreteATVShas toreassemblethephonemesintoaﬂuidcoarticulationchainofvisemeinterpolations. Letusetheterm“temportalscope”fortheoveralltimeofacoarticulationphenomena, whichmeansthatthestateofthemouthisdependingonthistimeintervalofthespeech signal.IndirectATVSthecalculationofaframeisbasedonthisaudiosignalinterval. IndiscreteATVSthevisemesandthephonemesaresynchronizedandinterpolationis appliedbetweenthem,asitispopularintexttovisualspeechsystems[22].Figure2.3 showsthisdiﬀerence.

Figure2.3: Temporalscopeofdiscrete(interpolating)anddirectATVS

Asymmetry

As mutualinformationestimationresultedanygivenstateofthevideodatastream canbecalculatedfairlyonadeﬁnablerelativetimewindowofthespeechsignal. This modelpredictsthatthetransientphaseofthevisiblespeechcanbecalculatedinthe samewayasinthesteadyphaseasFigure2.3shows.

Thismodelgivesapredictionabouttemporalasymmetriesinthemultimodalspeech process. Thisasymmetrycanbeexplainedwith mentalpredictivityinthe motionof

2.5.2Evaluation 19

thefacial musclestoﬂuentlyformthenextphoneme. Detailswillfollowinchapter“Temporalasymmetry”. Speakerindependence

Sincethedirectconversionisusuallyanapproximationtrainedonagivensetofaudio andvideostates,itsuﬀersheavydependenceonthedatabase. AsIdetailedbefore,for agooddirectATVSagoodlip-speakerneededtosharevisualdatawiththesystem. Talentedlip-speakersarerare,and mostoftheexperiencedlip-speakersare women. Thismeansthatasinglerecordingofonelip-speakergivesnotonlyaspeakerdependent system,butcollecting moreprofessionallip-speakerswouldresultagenderdependent system,sincethestatisticsofthedatawouldheavilybiased,oritisverydiﬃcultto collectenough malelip-speakertothesystem.

Evenifwewouldhaveplentyofprofessionallip-speakers,thereisaquestionabout the mixingthevideodata. Peoplearticulatediﬀerently.Itisnotguaranteedthata mixtureofgoodarticulationsresultevenanacceptablearticulation. The mostsafe solutionistochooseoneofthelip-speakersasaguaranteedhighqualityarticulation, andtryingtousehis/herperformancewith multiplevoices.

Iwillgiveasolutionforthisprobleminchapter“Speakerindependenceindirect conversion”.

2.5.2 Evaluation

Thebasesystemwaspublishedasastandalonesystem,andwasmeasuredw ithsubjec-tiveopinionscoresandintelligibilitytestswithdeafpersons.In mychapteraboutthe comparisonofAV mappings,IwillpositionthedirectATVSamongtheothersusedin theworld.

Oddlytherearequitefewpublicationsondirect ATVS. Thisisstrange,because thesystemisoneofthe mostsimpledesigns.Let metellapersonalexperiencefroma conferenceofEUSIPCO,Florence. AyoungresearcherwasinterestedinourJohnnie demo. He wasfrom ATR,Japan,andhepraisedoursystem. AsIexplainedthe workﬂowofthesystem,oneachstagehesaid“Wedidthesame”. Eventhenumber of PCAcoeﬃcients wasthesame. Attheend,hesaidthattheirsystemproduce signiﬁcantlyworseresultsthanours,itwasnotevenpublishedbecauseitwasﬂawed. Weagreedthenthatthe mostimportantdiﬀerenceisthelip-speaker’sprofessionality.

Hisworkcanbereadinjapanese[23]intheannualreportoftheinstitute,bythe wayinthesameyearwepublishedourresultsinHungarian[24,25,26]

Anotherexampleofdirectconversionpublicityisacomparisonstudyo fanunpub-lisheddirectsystem[27]usedasinternalbaseline.

Becauseofthisunderpublicity,itisimportanttopositiondirectATVSamongthe morepopularmodularATVSsystems,sincemostofthevisualspeechsynthesisresearch groupsalsotrytoimplementadirectATVS,buttheireﬀortsfailbecauseofthequality ofthedatabase. Attheﬁrstglancethis mayseemasbadnewsforourresearch,but thenoveltyofoursystemisstillunharmedsincetheworkinATRwasidenticalonlyin

thetechnicaldetails,andatrainingsystem’stechnologyitself,withoutthedatabaseis notawholesystem. Ourbasesystemisnewbecauseofthenewtrainingdata,andthe ﬁndingoftheneedoftheprofessionallip-speaker. Again,Iwouldliketoemphasizethat diﬀerencebetweenourbasesystemandtheonedevelopedin ATRisnot“only”the databasebutthetrainingstrategy,whichisoneofthemostimportantandfundamental partofanylearningsystem.

Thisnewandsuccessfulllearningstrategy makesourbasesystemnovel,butin thisthesisIfocustheresultsof myown,nottheresearchgroup. Johnnie Talker isacontributionofthegroup,andthefollowingextensionsand measurementsare contributionoftheauthorofthisthesis.

InthenextchapterIwillshowhowthebasesystemwiththeessentialdatabaseof theprofessionallip-speakercanberankedamongthewidelyusedATVSsystems.

Chapter3

In document Audiotovisualspeechconversion GergelyFeldhoﬀer (Pldal 23-27)