Adaptive, HHybrid FFeature Selection (AHFS) Pattern Recognition

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Adaptive, HHybrid FFeature Selection (AHFS)

Zsolt János Viharos Dr., Ph.D.

^a^,^b^,^∗^,^∗

, Krisztián Balázs Kis

^a

, Ádám Fodor

^a^,^c

, Máté István Büki

^a

aInstitute for Computer Science and Control (SZTAKI), Centre of Excellence in Production Informatics and Control, Eötvös Loránd Research Network (ELKH), Research Laboratory on Engineering and Management Intelligence, Intelligent Processes Research Group, H-1111, Budapest, Hungary, Kende u. 13–17., Hungary

bJohn von Neumann University, Faculty of Economics, Department of International Economics, Kecskemét, H-1117, Izsáki u. 10., Hungary

cEötvös Loránd University, Department of Software Technology and Methodology, Budapest, H-1117, Pázmány P. sny 1/C., Hungary

a rt i c l e i nf o

Article history:

Received 11 April 2020 Revised 18 September 2020 Accepted 3 March 2021 Available online 11 March 2021 MSC:

00-01 99-00 Keywords:

Adaptive

Hybrid Feature sSelection (AHFS) Combination of methods Statistics

Information theory Exhausting evaluation

a b s t r a c t

Thispaperdealswiththeproblemofintegratingthemostsuitablefeatureselectionmethodsforagiven probleminordertoachievethebestfeatureorder.Anew,adaptiveandhybridfeatureselectionapproach isproposed,whichcombinesandutilizesmultipleindividualmethodsinordertoachieveamoregener- alizedsolution.Variousstate-of-the-artfeatureselectionmethodsarepresentedindetailwithexamples oftheirapplicationsandanexhaustiveevaluationisconductedtomeasureand comparethetheirper- formancewiththeproposedapproach.Resultsprovethatwhiletheindividualfeatureselectionmethods mayperformwithhighvariety onthetest cases,the combinedalgorithmsteadilyprovidesnoticeably bettersolution.

ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Most realworld modeling problems can be formulated as estimation of some numerical value or classifyinga given number of samples. Theseproblems, moreoften than not, are very com- plex and can be defined by tens, hundreds and even thousands of variables.The highdimensional dataishard to handleby soft computingmethods, butinmostof thecasesmanyvariables are highly redundant,noisyand/or irrelevant whensolving a specific estimationorclassificationtask.Theneedarisestoreducethedi- mension ofa problemby selecting onlythe relevant features for a givenassignment withrelativelyfast methodscompared tothe highlyaccuratebutcostlyestimationmodels.FeatureSelection(FS) methodsaredoingexactlythat.Someofthemareabletoperform calculations fasterthan others butwiththe priceof losing accuracy, whileotherswork theother wayaround.Generally,thereis nouniversalsolution,somemethodsaremoresuitableforagiven assignmentthanothers.

We propose a new, adaptive and hybrid feature selection approach,which combinesandutilizes multipleindividual methods

∗Corresponding author.

E-mail address: viharos.zsolt@sztaki.hu (Z.J. Viharos Dr.).

inorderto achievea moregeneralizedsolution.It minimizesthe shortcomingsofeachincorporatedalgorithmsbychoosingdynam- icallythemostsuitableoneforagivenassignmentanddataset.

The paper contains seven sections. After the introduction the second section presents different feature selection methods and applicationsfromvaryingbranches.Thethirdsectiondescribesthe proposed combined algorithm, which is followed by the evaluation part considering comprehensive test on artiﬁcial datasets in relationtodatadistributions, noiselevelandoutliers.Inthenext paragraphtheproposedmethodisevaluatedthroughmanybench- markingdatasetsaccordingtomodellingerrorandcalculationtime demends, andinthe next section it iscompared by thevery recent, state-of-the-art feature selection methods. Finally, the con- clusion,acknowledgmentandliteraturesectionsclosethepaper.

2. Methodsandapplicationsoffeatureselection

Featureselection methods reducethe dimensionof aproblem by selecting or creating a representative subset of features for a givenassignment,making it easierformore(computationalcost) demandingalgorithmstomanage.Theycanbecategorizedasﬁlter, wrapperandembeddedmethods[1].MiaoandNiugaveastruc- tureasasimpletreeforpositioningvarious fetureselectiontech- https://doi.org/10.1016/j.patcog.2021.107932

(2)

niques.Keydecisionpointsarethelabelinformationassupervised, semi-supervised orunsupervised techniques,whileﬁlter,wrapper andembeddedmethods relate tothe type ofthesearch strategy.

Filtermethodreliesongeneralcharacteristicsofthetrainingdata to select the mostrelevant subset ofvariables withoutinvolving any learningalgorithm.Wrapper methodsuse learningalgorithm to detect possible interactions between variables, then selectthe bestsubsetoffeatures.Finally,embeddedmethodstrytocombine the advantagesof bothprevious methods.In embeddedmethods the learningand the feature selection part cannot be separated, furthermore feature selection and evaluation proceed simultane- ously. Srivastavaetal.publishedaReviewPaperonFeatureSelec- tion Methodologies andtheir Applicationsin which they showed a comparisontable aboutfilter, wrapperandembeddedmethods [2]. Itis describedthat wrapper methodsare usually superiorto theother twotechniques,however,theyhavethehighestcompu- tationalrequirement.Ontheotherhandthelearningmethodinde- pendence offiltermethodsservewitha moregeneralsolution in thisaspect.Muñoz-Romeroetal.proposedaparticularlysophisti- cated methodcalledInformativeVariableIdentifier (IVI)withthe aim to add interpretability for feature selection. They also compared the performance oftheir algorithm using various state-of- the-art measures of dependencies among variables [3]. A novel, improvedunsupervisedfeatureselectionalgorithmispresentedby Shangetal.usingmatrixdecompositionmethodascoretechnique.

The kernelizationofthelocal discriminantmodelwasintroduced forthenecessaryhandlingofnon-linearitytogetherwiththepro- posalofanewmeasurementnorm[4].Alsoimprovementsinun- supervised feature selection was proposed by Zhang et al. using guidedsubspacelearning[5].

FeatureSelectionmethodsareveryusefulinmanyfieldswhich work withhigh-dimensionaldata.In applicationsofcomputervi- sion andimage processing, the features describe artifacts of the digital image.Zinietal.addressedtheproblemofstructured feature selection in a multi-class classification setting by proposing a newformulationoftheGroupLASSOmethod[6].Theirmethod outperformedthestate-of-the-artapproacheswhentestedontwo benchmark datasets. Jiang and Li used the Minimal Redundancy MaximalRelevance(mRMR)methodforclassificationofcottonfor- eign matter using hyper-spectral imaging [7]. They showed the generalityofthemethodbybuildingdifferentlearningmodelson theselectedfeaturesandachievingsimilarestimationaccuracy.

Feature selectionmethods arealsoapplied formonitoringand faultdiagnosis,whereseveralsensormeasurementsandothervari- ables describe the actual state of the system. Zhang et al. used feature extraction and selection for multi-sensor-based real-time qualitymonitoringinarcwielding[8].Inanotherexampletheau- thorsused featureselectionforhigh-dimensionalmachineryfault diagnosis [9]. After selecting the relevant features with a hybrid solutioncombiningmultiplemethods,theyusedRadialBasisFunc- tion networks for classiﬁcation and evaluated their approach on two data cases showing that the method is useful for revealing fault-relatedfrequencyfeatures.

Timeseries forecastis usedformany thingslike weather,energy consumption, financialplans, etc. Manyvariableshave tobe takeninto considerationforan accurate forecast,duetothe high complexity of the task. Naturally, feature selection methods are very useful on this field, too. Carta et al. compared feature selection methods using ANNs in Measure-Correlate-Predict (MCP) wind speed methods [10]. Their results showed that the Multi- Layer Perceptron-basedwrapper method performedbetter in every test case, while the filter approach, the Correlation Feature Selection method, proved to be more efficient in terms of com- putationalloadandresulted inmoremodelinterpretability.Kong et al. did wind speed prediction using reduced support vector machineswithfeature selection[11].Theysuccessfullyselecteda

smaller,relevantsubsetofthefeatures,whichtheyusedfortrain- ingareducedsupportvectormachineandproveditseffectiveness throughdetailedanalysisandsimulations.FinallyIrcioet.al.used an adaptation ofexisting nonparametric mutualinformationesti- matorsbased onthe k-nearest neighbor for selectinga subset of multiple time series [12]. In their experiments they managed to strongly reduce the number of time series while keeping or in- creasingtheclassiﬁcationaccuracy.

3. Thebasicsofthenoveladaptive,hybridfeatureselection (AHFS)method

There are two broad approaches to measure the dependency between two random variables. First of all the correlation-based measures were examined todetermine the adequacy of features.

The linearcorrelation coeﬃcient isone ofthe mostknownmea- sure [13]. Other measures in this category are variations of this approach.Thereareseveralbeneﬁtsofthismeasure,ithelpstore- movenearzerocorrelatedfeaturestothetarget class,inaddition itcanhelptoreduceredundancybetweenselectedfeatures.

Theothercommonapproachfordeterminingdependenciesbe- tween variables is the useof information-theory basedconcepts.

Ingeneral,afeatureisgood,ifitisrelevanttothetargetclass,but itisnot redundanttoanyoftheother selected relevantfeatures.

Prominent relevancy and reduced redundancy can be achieved with the information theoretic ranking criteria, like entropy to measure ofuncertainty ofa random variable.The conditional en- topy is theamount of uncertaintyleft in X whena variable Y is introduced,soitislessthenorequaltotheentropyofbothvari- ables.TheamountbywhichtheentropyofXdecreasesreﬂectsad- ditionalinformationaboutX providedbyY andiscalledinforma- tiongain,whichisalsoused asa synonymofmutual information, whichistheexpectedvalueofinformationgain.

Variousmeasuresinheritedfrominformationtheory,likeShan- nonentropy,jointandconditionalentropy,mutualinformationand symmetrical uncertainty are frequently applied in recent feature selectionsolutionsandapplicationslikeFCBF[14]andmRMR[15]. The information-theory based measures can observe higher level correlationsaswell andtheyare becomingever morepopularin recentdecades[16].Thefollowingalgorithmsarethemostpopular forselectingtheappropriatesetoffeatures.

• Forwardfeatureselection(FFSA)[17]

• Modiﬁedmutual information-based feature selection (MMIFS) [18,19]

• Linearcorrelation-basedfeatureselection(LCFS)[13]

• Fast Correlation Based Filter algorithms (FCBF and FCBF#) [20,21]

• MinimalRedundancyMaximalRelevance(mRMR andmRMR#) [15,22]

• Joint Mutual Information Maximisation and Normalized Joint MutualInformationMaximisation(JMIMandNJMIM)[23,24]

• Euclidiandistance-basedselection[25]

3.1. Motivationandconcept

Thepreviousparagraphshighlightvariousmethodsandreallife applicationsofFeatureSelection(FS)algorithms provingtheirim- portance.VariousFSmethodsexisttoday,andthereview oftheir basic concepts, theories, applications and benchmarking compar- isonsmirrorsthat:

• Ingeneral,no”bestof” or”bestpractice” featureselectionsolution isgiven.Theapplicationsandscientificpublicationsmirrorthat the”bestsolutions” may largelydifferfromcasetocase.FS is appliedfrequently,butthemostpromisingsolutions aredata- setand/orapplicationfieldspecific.Wangetal.alreadyformu-

(3)

latedthatthereexistsarelationshipbetweentheperformance ofafeature selectionalgorithmandthecharacteristics ofdata sets[26].

• According to thedeﬁnition offeature selection this DataMin- ing (DM)toolisappliedbefore atrainingalgorithm. Onrareoc- casions, havingthe resultsoffeature selection (so, havingthe list of selected features) a theoretical model is built (e.g. using equations), furthermore, feature selection is seldom combined/integratedwithlearning[27].Consequently,ingeneralit is a preliminarystep of a trainingalgorithm that is basedon thesamedataset.

• Featureselectionisappliedfortwomainreasons:

– Reducing thenumber of features signiﬁcantly decreases the computationalrequirementsofthefollowingmodellingsolution andalsothesensorrequirementsofthegivenapplications.

– The elimination of irrelevant information (irrelevant features)resultsmoreaccuratemodels(decreasesthenoise).

Ithasto bementionedthat inanotheraspecte.g. intechnical applications,likefailuredetectionandforecast,ascomponents of diagnostics and supervision,it is valuable having, in some degreeredundantinformatione.g. todetectnon-conformsitu- ations[28]ortoovercomethefailuresofsensorsoranyother dataprocessing components.Itis advantageous alsowhen in- completedataarise[29].

• Roughly speaking feature selection algorithms consist of three maincalculationcomponents:

– Theﬁrstpartisthecalculationofone(orrarelymore)metrics usingthegivendatasetforhavingsomenumericalevaluation oftheindividualvariables orvariable sets.Typically,two as- pectsareevaluated:redundanciesamongvariables andthe correlationsbetweentheindividual variablesandthe (later on)estimated(target)variable.

– Thesecondpartisasearchalgorithmthatappliestheabove measure/metricstodeterminetheorderand/orselectedsetof features.Therearemanysuchsolutions,onecoulddifferen- tiateamongthem whetherthey are greedy(like Sequential FeatureSelection,SFS)orsomehowoptimizedalgorithms.

– The third part is the applied modeling methodology.In ﬁl- ter methodsitis fullyseparatedfromthe featureselection component,howeverinwrapperandembeddedmethodsit isintegratedondifferentlevels.

Thescientiﬁcliteraturerepresentsgreatvarietyofcombinations oftheappliedmeasuresandsearchalgorithms.

Notalloftheavailable/possiblefeatureselectionmetrics(measures) were introduced above, only the most frequently applied (because of their superiority over other methods), meaning that a new solution is valuable if it could exploit the advantages of anyof thegivenorlater introduced FS techniques.Basedon this idea, ahybridsolutionisproposed inthepaperwhich combines the given, available (supervised) feature selection techniques that have their own speciﬁc, but ﬁxed feature evaluation measures/metrics. The proposed methodology can be extended easily withanyofnovelFSmethodsandmetrics,so,anyalternativefea- ture selection techniquecan be a part of the proposed solution, too. This is one aspect of the hybridity. Since there is no gen- eral, ”best” FS algorithm, which indicates that no ”best” feature measure/metrics is given, moreover, because the main aim of FS is to supportthe building ofa learningmodel afterit’s usage, it utilizes the applied learning model in its mathematical algo- rithm. (Theauthor’simplementationuses the MultiLayer Percep- tron (MLP) modell,however, asidefromsome special techniques, anyotherlearningmodelscanbeappliedhere.)Thisisthesecond aspect ofthehybridity.Sincemanycombinationsofappliedmea- suresandsearchsolutionsarepublishedinthestate-of-the-artlit- erature,theproposed algorithmapplies thesimplebutfrequently

used Sequential Forward Selection (SFS) technique. It is a feed- forwardcalculationmethodthatextendsthealreadyselectedsetof featureswithonlyoneadditionalvariableinits iterationsteps.In thisaspectitisagreedyalgorithm,naturally,laterontheproposed solutioncanbeimprovedsubstitutingitwithamoresuitableand generalized search solution butthis additional research direction isbeyondthescope ofthecurrentpaper.According tothe above state-of-the-artstatements,thereisnogeneral,uniquefeaturese- lectionmethodology, ithastobe alwaysadaptedtothegivenap- plicationanddataset,consequently,anovelsolutionisneededthat isadaptive.The mainaim isto ensurewiththe adaptivity ofthe proposed algorithm ateach iteration stepof theapplied Sequen- tialForwardSearchiteration.AtacertainSFSstepasetofalready selected variables (features) are given and the method evaluates eachpossibleextension ofthisdatasetby oneadditionalvariable.

In the state-of-the-art solutions it uses (only) one feature selec- tionmeasure forselectingan additional variableforthefinal ex- tension,so,thestate-of-the-artalgorithmsiterateinthespaceofthe variables.Adaptivityoftheproposed algorithmisrealizedinsuch a waythat at an individual stepof the featureselection algo- rithmit iterates notonly in the space ofthe variables butin thespaceofavailablefeaturesselectiontechniques,too.Thisis thecoreideaofthepaper.Sincethereisno”bestof” featurese- lectionmeasure/metrics, choosing one fromthese candidates can only be realized by using theapplied learning methodas an independent evaluation tool. It means, each such candidate model configurationis builtup andthe modelhaving thesmallestesti- mationerrorspecifieswhichonevariablehastobeselectedasthe currentextension.Thismotivations andconcepts ledto thenovel Adaptive,HybridFeatureSelection(AHFS)algorithm,introduced, describedandevaluatedinthepaper.

3.2. Searchstrategy:Sequentialforwardselection

Filter, wrapper and embedded methods apply a search strat- egyfortheselectionofthefeatureorder.GuyonandElisseeff presented various feature selection possibilities for ﬁnding a subset of variables for building up a good predictor [1]. Such methods are forvariable ranking,elimination ofredundantvariables, variable subset selection, Nested Subset Methods, forward selection andbackwardelimination.Consequently,therearevariouspossible solutions forthe appliedsearch methodology. SequentialForward Selection (SFS)wasselected, however,almost allthe other possi- bletechniquescanbe appliedinside theproposed AHFSmethod- ology. The SFSalgorithm is a bottom-up search procedure which startfromanemptysetandgraduallyaddsfeaturestothecurrent featureset.The decisionofselectiondependsonapredetermined evaluationfunction.Ateachiteration,thefeaturetobeincludedin thefeaturesetSisselectedamongtheremainingavailablefeatures infeaturesetF, sothisextended setSshouldproducemaximum valueofthecriterionfunctionused[25].

The simplicity andspeed of the SFS make a compromise be- tweenhigh-dimensionalsearchspaces,slowevaluationandtheex- ecutiontimedemandedofthealgorithm.

3.3. Modelingmethod:Multi-Layerperceptron(MLP)

Artificial Neural Networks(ANNs) are powerfulcomputational models,whichcanbeutilizedforsolving complexestimationand classification problems due to their robustness and capability of highlevelgeneralization.AnANNimplementsthefunctionalityof the biological neural networks by building up a network of au- tonomouscomputationalunits(neurons)andconnectingthemvia weightedlinksdefinedby thefirstpioneersW.S. McCullochand W.Pitts[30].OneofthemostpopularandwidespreadANNmodel

(4)

Fig. 1. Operation graph of the proposed algorithm depicting two steps.

type istheMLP[31],anotherconceptwhichbecamepopularand usedwidelynowadaysisdeeplearning[32].

The MLPmodel is used in the algorithm as part ofthe evaluation, meaning that inevery step, wherea feature isevaluated, anMLPmodelistrainedtodeterminehowmucherrorthefeature yields-comparedtoothers.

3.4. Theproposedadaptive,hybridfeatureselection(AHFS)algorithm

Intheprevious chaptersthedescribedfeatureselection methods are described as eﬃcient algorithms which use interdepen- denceoffeaturestogetherwiththedependencetothegivenclass.

However, their performance is perceptibly diversethrough different datasets and assignments. The proposed algorithm combines thedifferentobject functions/measuresby givingevery algorithm thechancetosuggestthenextpossiblebestfeatureinthesequen- tialforwardselectionalgorithm.However,thisconceptisindepen- dentofthecurrentlyappliedsearchmethod.

Inordertoﬁndthebestfeaturesubset/orderthealgorithmuses forward selection technique with sequential search strategy. The terminationcriterion isthereaching ofak-sizesubset, prescribed beforerun, e.g. by ahuman expert. Theproposed algorithm uses predeterminedfeatureselectionmethods.Therearetwoimportant phases:

• Firstoneisthefeatureselectionpart,whichiteratesthroughthe predetermined feature selection algorithms, invoking them with thealreadyselectedfeaturesetS.Everyalgorithmproposeone featureasthenextpotentialbestfeaturetobeselectedaccord- ingtoitsownmeasure.Thecandidatefeaturesarecollectedto Spfeaturesetineveryiteration.

• Havingcandidates foran additionalfeatureinheritedfromthe different feature selection methods, the aim of the second phase isto select one fromthem, so,in the next phase itit- eratesthroughthepromisingadditionalfeaturespace.Thisselec- tionisbasedonthe(highest) accuracyofthetrainedartiﬁcial neuralnetworkmodels.

Fig. 2. Model error as a function of the number of selected features related to all the three chosen distributions.

(5)

Fig. 3. Model error as a function of the number of selected features related to all the three noise levels.

FeaturesfromsetSand1candidatefeaturefromsetS_pwillbe usedasinputs,aswellasthetargetvariableasasingleoutput.

In thisway asmany models asunique features are inSp are be generated.As S_p is nota multiset,thisquantity isjustthe same as

|

^S^p

|

^. În ôrder ^to êvaluate ^the performance of a candidate features, the MLP models (described in paragraph 3.3) with the generated configurations are trained. In every itera- tiononlyonefeatureisselectedandaddedtotheoutputsetS, untiltheexpectednumberoffeaturesisreached.

The operation of the proposed algorithm on Housing dataset [33]isdemonstratedonFig.1.Thenodesofthegraphcontainthe indices of features as numbers withframes using different color variations. The meaning ofcolors varydependingon the state of theexaminedvariable.Featureswithblackframesarethealready selectedfeature setSinthecurrentstate,whilethecolorful ones (red,green)arethecandidatefeaturesinsetS_p.Greenframesigns thebestfeature,whichhasthesmallestestimationerrorcompared to theother possiblevariables inred frames.Directededges rep- resent the transition between different states of feature subsets, which are in thiscasethe differentpredeterminedfeature selectionmethods.

Inadditiontothebestsubsetoffeatures,theproposedalgorithm gives acollectionofthe(inthegivensearchstep)best featureselec- tion methodsper variablesas result. Theselistscontain thenames ofthealgorithms,whichhaveselectedtheexaminedvariable.

4. Evaluation

Four different directions of testswere applied to comprehen- sivelyevaluateandcomparethecapabilitiesoftheproposedAHFS algorithm:

• Knownlinearfunctionswereusedfortestdatagenerationand additionaleffectsweregivento thisdata byvarying itsdistri- butions,noiseandoutlierlevelsforanalyzetheireffectsonthe performanceoftheAHFS.

• Known non-linear functions were used for test data genera- tionbyGaussiandistributionwithaddedmiddellevelnoiseand outliers.

• Various test were performed on real, well-known benchmark data-sets from the UCI Machine Learning repository and on someotherrealdata-sets,collectedbytheauthorsduringvari- ousindustrialapplications.

• Finally,AHFSwascomparedonasmallandalsoonabigdata- settosomeother,veryrecenthighestlevelstate-of-the-artfea- tureselectionalgorithms.

4.1. Evaluationonartiﬁcialdatasetswithknowneffects

4.1.1. Experimentsonlinearfunctionswithvaryingdistribution,noise levelandoutliers

Inordertoprovidemuchmoreinsightintotheadvantagesand challenges of the proposed data analysis approach an evaluation through artiﬁcial datasets with known and tunable effects were carried out. During these tests the performance of the proposed algorithm (AHFS) andits components withrespect to the evolu- tion ofmodel errorwascompared whilethe numberof selected featureswasincreased.

At first, datafrom three differentsource distributionswere generated. 1000-by-15 feature matrices consisting of entries whose value came from the specified distributions independently upon the value of other cells were created. 10 out of the 15 features were usedtocreatethe targetcolumnby applyinga linearfunc- tion with random coefficients to the chosen features (let us call

(6)

Fig. 4. Model error as a function of the number of selected features related to all the three outlier levels.

theseasinformativefeaturesinthenextparagraphs),whilethere- maining5featureareindependent,randomvalues.Thechosendis- tributionswereUniformontheinterval[0,1],Poissonwithlambda parameter50,andStandard,normaldistribution.

TheresultscanbeseeninFig.2showingthemodelerrorsac- cording to the numberof selected (input) features. It is obvious thatAHFScanimmediatelyﬁndeveryinformativefeatureresulting in a strictlymonotonously decreasing modelerrorwhich issatu- ratedafterall theinformative featureshavebeenfound. Thema- jorityofotheralgorithmscan’taccomplishthisoreveniftheycan, theyresultedinmuchhighermodelerrors.Thisdifferenceisespe- ciallyoutstandingincaseofthePoissondistribution.Itcanbeseen thattheproposedalgorithm(AHFS) outperformsalloftheindividual (incorporated)methodsinalldatadistributioncases.

Inthesecondcase,theeffectofaddingnoisetothedatatodiffer- entextentsweretested.Hereafter,forthesakeofsimplicityuniform distributionwasusedduringthedatagenerationprocessandafter creatingthe target,itwasnormalizedlinearlytothe[0,1] range.

Thenoise distributionwasGaussianwiththreedifferentvaluesof theirstandarddeviation(sigma,

σ

^):^0,5;^0,5/3;^0,5/6^(0.5^is^the^half

of the total data range). Noise was applied to all the entries of the featurematrixandthetarget columnaswell. Theresultsare representedinFig.3.Accordingtothegraphs,AHFS ismuchbet- ter against anyother oftheindividual (component)algorithms in respect to the modelerror. Thisfact is outstandinglyjustiﬁed by the graph belongingto the ”sigma= 0.5/3” and ”sigma= 0.5/6”

(smallernoise)cases.ItcanbeseenthatAHFSperformssigniﬁcantly betterthantheotheralgorithmsonallnoiselevels.

Interestinginsightscanalsobeaquiredbyexposingtheuniformly generated,lineardatasettovariousoutlierlevels.Resultsareshown inFig.4.

Different portions of the entries of these matrices were ran- domlyselectedanduniformlydistributednoisegeneratingtheout- liers (with parameters speciﬁed below) was added to the chosen values. Three different noise levels were labeled as ”small”,

”mid(le)” and ”high”. The parameters corresponding to these la- belsrespectivelyarethefollowing:2%±[1,1.5];5%±[1.5,2];10%

±[2,3].Thepercentagescorrespondstothechosenportionsofthe matricesandthe”±” signreferstotheprocessofgeneratingout- liers.Namely,that afterchoosingavalue toberuined,thesignof theuniformlydistributedvalueinthespecifiedrange(specifiedin [])wasthenaddedtothecellofthematrixthatwaschosen ran- domly.Theoriginaldatawerepre-normalizedto[0,1]rangebefore, so,thee.g.+[1,1.5]transformsthegivendataclearlytooutsidethe original,completedatarange,so,theoutlierlevelissignificant.The benefitsofAHFSisslighterthanthatwasinthepreviousexperiments buttheadvantageisstillobviousforhandlingoutliers.

4.1.2. Experimentsonnon-lineardependencies

Finally,thenon-linearbenchmarkregressionproblemcalledFried- man1wasused.TheresultsareillustratedbyFig.5.Inthisproblem thereare 5informative features whichare ina highlynon-linear relationship tothe target, but3more features whichdidn’t have anythingtodowiththetargetwere putin.Twodifferentscenar- ioswere distinguished:one,inwhichthedata-setwasn’texposed toany disturbingeffect(and,withuniform distribution),andan- other,inwhichnoise(withsigma=0.5/3)andoutliers (on”mid”

level) with Gauss distribution we applied in the same time.As it

(7)

Fig. 5. Model error as a function of the number of selected features related to the two variants of the Friedman1 regression problem.

wasseenintheexperiments,theproposedalgorithmoutperformsthe otheronesincaseofhighnon-lineardependenciesaswell.

4.2. Evaluationonbecnhmarkingdatasets

There are various applied test cases, the UCI machine learning repository isthe probablymostfrequentlyappliedby theAr- tificial Intelligence/Machine Learning community [33]. Iris asthe most frequently used, well-known dataset was selected for having as one of the first classification test assignment. For regres- sionorientedtestsanotherdatasetnamedHousing (alsonamedas Boston)wasselectedasapublicbenchmarkcase.Inorderforhav- ingnoisefreedatasettogetherwithalsonoisydatafromthesame domainCalculatedcutting andMeasuredcuttingisapplied,respec- tively.Windturbine monitoringandSituationdetectionduringspe- cialmachining aretestcasesforhigherdataamountandforhigh datacomplexity/variety,withvariouslevelsofnoisydata,incorpo- ratingparty redundancy,highnon-linearity,outliers,non-uniform datadistributionandmanyotherdisturbing,industrialreal-lifeef- fects.Thedatasetscanbedescribedas:Calculatedcutting:thistest caseconsistsofdifferentmachinesetting,cuttingtool,monitoring and product quality parameters of metal cutting (turning); Mea- suredcutting:thistestcasealsoconsistsofcuttingparameters,but incontrasttothepreviousone,thiswascollectedfromrealmea- surements;Iris:thisisone ofthemostpopularpublicbenchmark database [33], which is used frequently for comparing different Machine Learningmethods; Housing:anotherwell-known,regression benchmarkdataset,”Housing”,wasalsoselected [33],which describes the propertiesof some suburban realestate of Boston;

Windturbine SCADA: thistest casecontainsmeasured valuesthat describe the detailedstate ofwind turbinesandthey were gath- eredfromthewind turbineSCADA; Windturbine monitoring: this test caseutilizesthehighfrequencydataofthemanymonitoring sensors inside a working wind turbine; Situationdetection during specialmachining:thisdatasetwasbuiltfromhighfrequencymon- itoringparametersofaspecialmachiningprocessovermultipleex- periments.Table1summarizestheirsizesandtypes.

4.2.1. Measuredcutting

This simple and small test case is inherited from real metal (steel) cutting measurements (preformed by the corresponding au- thor).Therearethreeassignmentsregardingthisdataset.

In Fig. 6each lineof the diagramsdescribesthe performance of asinglefeature selectionmethod,where thexaxisshowsthe number of features used for building the model and the y axis

showstherelatedmodelerror.Forthesakeofsimplicity,onlythe ﬁrst50features wereselected foreach assignment,ifthedataset containsatleastthatmanyfeatures. Iftherearelessthan50fea- tures,theneveryvariableisselected.

Fig.7showstheoperationoftheproposedalgorithm.Thistable is asimpliﬁed version ofthe operation graphshown inFig. 1at Section 3.4. Each row correspond to a feature selection method, whilethecolumns denotethenumberofselectedfeatures which istheequivalentofthestepnumberthealgorithmiscurrentlyat.

Thegraycells,withplussigninsidethem,markthemethodsthat choose the best feature in that step, while the white cells, with minus sign, mean that those methodschoose other features that provedtobelessdesirableduringthemodelbasedevaluation.The operationdiagram isusefulto seehow thealgorithmworks,and howdiversetheselectionoftheindividualmethodsis,moreover,it representsthat inmanycasesmultiple featureselection methods choose thesame,best parameter, whileothers selectworse ones intermsofmodelaccuracy.

TheﬁrstassignmentistheestimationofsurfaceroughnessR_a. Theproposedalgorithmpreformed slightlybetterthan theindividual methods(Fig.6(a)).In manycasesthesamefeatures areselected byeachfeatureselectionmethodsasshowninFig.7(a).Themain componentof the cutting force was estimatedin the second assignment.Theselectionoftheﬁrsttwofeatureswerediverse,but theAHFSselectedthevariablewithminimalerrorvalueproposedby LCFS, too (Fig.6 (b)). The last assignment for thisdataset is the cutting temperature estimation. Fig. 6 (c) shows that the greedy evaluationaffectstheselectionslightlyatstep3,butbeforethethird andafter thefourth variable theAHFS doesnot sufferfrom this in- convenience.

4.2.2. Situationdetectionduringspecialmachining

Thesituationdetectionduringspecialmachiningtestcaseconsists highfrequencymonitoringparametersofaspecialmachiningprocess.

Thedatasetcontains over 10000samplesandmorethan 1100vari- ables.Theassignmentisabinaryclassiﬁcationproblem:estimating whetherthesituationhashappenedornot.

AccordingtoFig.8,theORIGmethodperformedextremelypoor comparedtotheother methods.Furthermore,theproposed AHFS algorithm provides the best performance but it is worth noting, thattheMMIFSmethodholdsthesamemodelerrorsastheAHFS fortheﬁrst 10 selectedfeatures, althoughit loses accuracycom- paredto the AHFSabove 10 features. This ismirrored in these- lectiontable inFig. 9,too.It is alsoworth noticing,that this as-

(8)

Table 1 Dataset properties.

Samples Dimension Category

Calculated cutting 450 9 Industry

Measured cutting 120 7 Industry

Iris 150 7 Biology

Housing 506 14 Socioeconomy

Wind turbine SCADA 839 57 Energy

Wind turbine monitoring - Oil pressure 1857 998 Energy Wind turbine monitoring - Oil temperature 1886 866 Energy Wind turbine monitoring - Bearing temperature 1876 951 Energy Situation detection during special machining > 10,000 > 1,100 Industry

Fig. 6. Performance on Measured cutting dataset: (a) R aest, (b) Fcest and (c) T est.

Fig. 7. Algorithms selected on Measured cutting dataset: (a) R aest, (b) F cest and (c) T est.

signment’sselectionmatrixissparse,meaningthatatmostofthe steps,onlyonefeatureselectionmethodprovidedthebestfeature.

Ithasto beemphasizedthattheproposedalgorithmwasalready successfully appliedintheindustrialcollaborationoftheauthorsbe- fore writing thispaper.The models prepared accordingto there- sults ofthe paperarealready incorporatedinthe control system

oftherelatedmachinesandworkswell (detectsdifficultidentifi- ablesituations)ontheshopfloor,inthedailyproduction.

4.3. Comprehensiveevaluation

The previous sections detailed the performance results of the individual dataset testsandin theend ofthissection a comprehensive evaluationisgiven tosummarizethe resultsof theindividual dataset testsand highlightthe main characteristics of the proposed algorithm. To give an overall performance evaluation, an average of the individual feature selection algorithms’ performances hasbeencalculatedandpresented.Themodelingerrorvalues,mea- suredattheevaluationof individualestimation assignments,had tobenormalizedusingalinearscaleintotherangefrom0to1in ordertocomparethemtoeach-otherandtoenablethecalculation ofanaverageperformance.

Asdetailedbefore,somedatasetshavelessthan50featuresand in their case, naturally, the algorithms selected less features and consequently,lessmodelwasbuiltforevaluation.Asconsequence, theaverageperformancewascalculatedfromlessandlessassign- mentsasthenumberofselectedfeaturesgrows.Forexamplethe average performance of theﬁrst 2 selected features is calculated fromall ofthe assignments, becauseevery datasethas atleast2

(9)

Fig. 8. Performance on Situation detection during special machining dataset: SM situation est.

Fig. 9. Algorithms selected on Situation detection during special machining dataset: SM situation est.

features toselect, butinthecaseoftheﬁrst30features,onlythe assignmentsofthewindturbineSCADA/monitoringandthesitua- tion detectiondatasetswere used becauseall theother haveless than30variablesintotal.

Fig. 10 shows the average of the individual feature selection performances measured on each assignment. Each line describes the performance ofa single, individual feature selection method, wherethe xaxisshowsthenumberoffeatures usedforbuilding themodelandtheyaxisshowsthenormalizedmodelerror.

TheoverallperformancediagramsinFig10

mirrors,thattheproposedalgorithmsignificantlyoutperforms the individualmethodsingeneral.Furthermore,thebiggestdifferencere- veals itself in the case of the first 4 to 25 selected features which means that thenew method finds the mostimportant features ear- lierthantheothermethods.

Table 2mirrors the overall performance (model accuracy) im- provement oftheproposed algorithmcomparedtotheindividual, state-of-the-artmethods.

The rows of the table denote the base of the comparison, in the ﬁrstrow(MEAN),theproposed algorithm iscomparedtothe mean performance ofthe individual methods; inthe second row

Table 2

Comparison of the proposed algorithm with the individual methods.

FIRST 5 FIRST 10 FIRST 30 FIRST 50

MEAN 183% 218% 236% 304%

MIN 139% 152% 153% 167%

MMIFS 147% 156% 156% 183%

(MIN), it is compared to the minimumof the individual method errorsforeachfeaturenumber;andinthethirdrow(MMIFS),itis comparedtothebestindividualmethod,whichprovidedthebest modelaccuracyingeneral.Thecolumnsindicatethatthemodeler- rorsoftheFIRSTNfeatureswere averagedinthecomparison.So, forexamplethefirstcolumnofthefirstrowshowsthat theaver- agemodelerroroftheindividualmethodsis210%ofthemodeler- roroftheproposed AHFSalgorithm,whenconsideringofthefirst 5features. Inother wordsthe averagemodelerror,calculated on thefirst5featuresselectedbytheproposedalgorithm,isapprox- imatelyhalfoftheaveragemodelerroroftheindividualmethods.

Thepercentagesvaryfrom139%to304% withtheoverallaverage

(10)

Fig. 10. Average performance of the proposed algorithm and the individual methods.

of 183%, which concludes that the AHFS nearly doubles the ac- curacy (resulting in around half value for the related modeling error)comparedtotheindividualmethods,makingitasuperior feature selectionalgorithm.Itis worthmentioningthat theindividual algorithmsare well-knownandwidelyapplied,bestmeth- ods.

4.3.1. Calculationtimeperformance

The previous paragraphs evaluated the proposed AHFS algorithm in the modelling accuracy point of view, the current one evaluates it’s computational requirement. Dongmei Mo and Zhihui Lai proposed a robust jointly sparse regression for effective feature selection [34]. Their experimental results indicate that the proposedmethodcanoutperformthelocalitybasedmethods(LPP, OLPP, FOLPP), the joint sparsity learning methods (JELSR) and theL1-normbasedmethods (SLE,LPP-L1)withstrongrobustness.

However, the reported computational complexityis much higher thanthetraditionalmethods,suchasPCA,LPPandRR,inthisas- pectthisscientific resultissimilar totheAHFSresults.Moreover, Zhaoet.al.presentedaclassifiednestedequivalenceclass(CNEC)- basedapproachtocalculatetheinformation-entropy-basedsignifi- cance forfeatureselection usingroughsettheory [35].Theirgoal was to increase the computational efficiency of the information- entropy-basedsignificancemeasureandshowedthattheirsolution is indeedgreatlyimprovedthe runtimeofmultiplefeatureselec- tionalgorithmsthatuseroughsettheory,whichcanbebeneficiary infurtherdevelopmentsoftheAHFSmethod.

Fig.11representstherunningtimedemandsofdatasetsinacom- prehensive way, considering the number of samples (X, horizontal axis) andfeatures (Y,vertical axis) displayed ona logarithmic scale, while the sizes ofthe circlesrepresent the calculation timerequire- ment. For simplicity, when multiple assignments are deﬁned on the same dataset, the average computation time of the assignments is calculatedand plotted. It is worth noticing,that inthe case of Wind turbine SCADA, the proportion of training time is signiﬁcantly higher, thanin thecaseofother datasets. Thischar- acteristicoftheproposedmethodistheconsequenceof:(a)ifthe datasethaslowdimensionandasmallernumberofsamplesthen the computation time of model building andfeature selection is in thesame orderof magnitude, (b)ifthe numberof features is considerably high(like morethan 500), then computational time

ofthemutualinformationmatrix(requiredbymostoftheindivid- ualmethods)becomeshigherthanthetimetakenbymodeltrain- ing.Takingeverythingintoaccount,thealgorithmhasbiggertimede- mand,between171%and7571%,soanaverageof2246%ofthemean overalltheother,individualalgorithms.

Because ofthe alreadyhigh computational timesof the AHFS algorithm,the JMIMandNJMIMmethodswere discarded asthey aresigniﬁcantlyslowercomparedtotheother informationtheory basedsolutions, moreover, they don’t add enoughvalue in terms ofpotential featurecandidates,tojustifytheirhighcomputational demands.

The complexity can be linear to the number of iterations in a random search, but experiments show that in order to ﬁnd best feature subset, the number of iterations required is mostly atleast quadratic to thenumber of features. The main reasonis that mostexisting subsetsearch methods demandtheanalysisof pairwise correlationsbetween all features (named F-correlation).

Withquadraticorhighertimecomplexityintermsofdimension- ality thesealgorithms donot havestrong scalabilityto dealwith extremehighdimensionaldata.

The runtimecomplexityof the proposed AHFS methodhighly depends on the complexity of other FS algorithms used as sub- modules. However, runtime requirements and alsocomplexity of theAHFSalgorithmcanbereducedinthefuturebyswitchingthe independent,interchangeablemodules.

4.4. Comparisonwithotherstate-of-the-artfeatureselection techniques

This paragraph focuseson thecomparison of the proposedAHFS tothefamilyofRelief-basedfeatureselectionmethods[36],theyare reallyrecent developments of theﬁeld, moreover,they are state-of- the-artalgorithms.Reliefcalculatesafeaturescoreforeachfeature.

Usingthisscore,the mostimportantfeaturesofa datasetcan be rankedandselected. Thefeature scoringisbasedon theidentiﬁ- cationoffeaturevalue differencesbetweennearestneighbor(NN) instancepairs.Incaseoffeaturevalue difference,thescoreisde- creasing(increasing),ifthecorrespondingclasslabelsarethesame (different).The originalReliefcandealwithdiscreteandcontinu- ous attributesanditislimited toonlytwo-class problems.How-

(11)

Fig. 11. Comprehensive feature selection time performance considering different dataset sizes.

Fig. 12. Performance as model error on (a) the Calculated cutting dataset: R aest(imation) and (b) Wind turbine monitoring: Bearing temperature est(imation). In case of (b) for the SURFstar and MultiSURFstar methods their RMSEs are much higher, therefore, they are excluded from the ﬁgure (they are far above the current scale). Also worth mentioning, that the Exhausive search is unfeasable with higher number of features, like in case (b).

ever, multipleextensions havebeenproposed to deal withnoisy, incompletedataanditisalsoadaptedtoworkwithregression.

Afreely availableReliefBasedAlgorithmTraining Environment (ReBATE) wasappliedtofairly test andcompare theperformance of the proposed AHFS algorithm with this widely used, recent, state-of-the-art feature selection methods. ReBATE has been im- plemented withﬁvecore Relief-basedAlgorithms (RBAs):ReliefF, SURF, SURFstar (SURF^∗), MultiSURF, MultiSURFstar (MultiSURF^∗).

These feature selection methods were explicitly developed for noisy regression tasks, and were tested on numerous real-world

problems, they are among the most recent and best performing state-of-the-arefeatureselectionalgorithms.

Furthermore,inordertocomparetheresultsto”theideal” results, AHFS wascompared with an Exhaustive search in case ofa smaller dataset(Calculatedcutting,Raest).Allpossiblesubsetsareselected andevaluatedduringthiskindoffeatureselection,thebestresult for each subset is selected and shownin Fig. 12 (a). Exhaustive search represents the best possible results withexceptional high computational time requirement. With high number of features, the executiontime explodes,so, in such cases this kindof eval- uationisunfeasible.

(12)

As presentedin Fig. 12 the proposed AHFS method serveswith thesmallestmodelingerror outperformingotherstate-of-the-artfea- ture selection methods.The feature subsetsproduced byAHFS are more consistentandthereforemore reliableevenin smallersub- sets thanits competitors’.Itisworth mentioningthat theperfor- manceofAHFSisthesameastheexhaustivesearchinFig.12(a), howeveritsrequiredexecutiontimeisjustafractionofwhatex- haustive search requires. In the given case of Calculated cutting, Ra est,exhaustivesearch has9525%timesbiggercalculation time demand, thanthe proposed AHFSalgorithm. Asconclusion,it was measured thattheproposedAHFS algorithmoutperformedtherecent state-of-the-art featureselectionalgorithms, moreover,itserveswith thesamemodelingaccuracyastheidealexhaustivesearchalgorithm, butthe computationaldemandofAHFS is smallerin severalmagni- tudes. These results also provethe superiority oftheproposed AHFS algorithm.

It shallbe mentioned thatthanks to the ﬂexible structureof the proposed AHFS algorihtm, these other, state-of-the-artfeature selec- tionalgorithmscanbeeasilyincludedintheAHFSsolution.

5. Conclusions

AnovelAdaptive,HybridFeatureSelection(AHFS)approachhas beenpresented,whichchoosesacombinationofmostsuitablefea- tureselectionmethodsforagivenprobleminordertoachievethe best featureorder (aimingtobuild up themost accuratemodel), e.g.becausethereexistsnogeneral,”bestof” or”bestpractice” fea- tureselectionsolutioninMachine Learningﬁeld.Adaptivity ofthe proposed algorithmisrealizedinsuchawaythatatanindividual step of thefeature selection algorithm ititerates not only inthe spaceofthevariablesbutinthespaceofavailablefeaturesselec- tiontechniques,too.Thisisthecoreideapresentedinthepaper.A doublelevelhybrid solutionisproposedinthe paperbecausethe introduced algorithm combines thegiven, available feature selection techniquesandalsoit utilizesthe appliedlearningmodel in itsmathematicalalgorithm.

Different feature selection methods were presented in detail withexamples oftheir applicationsandan exhausting evaluation hasbeencarriedout tomeasureandcomparetheperformance of themtotheproposedapproach.

Evaluationandcomparisonexperimentswereperformedonar- tiﬁcialdatasetswithknowneffectshaving(simple)lineardependencies withincludedindependent variables,together with vary- ing distributions, noise levels andoutlierdisturbances. AHFS was compared also on artiﬁcial, but highly non-linear dataset ruined withGaussiandatadistribution,middlelevelnoiseandoutliers.In- dependentlyfromthelinearity,distribution,noiseandoutlierpresence AHFS consequentlyshowed itssuperiority over the individual,state- of-the-art featureselection algorithms provingits robustnessagainst suchchallengingeffects.

Test on real-life benchmarking andindustrial datasets proved that while the individual featureselection methods mayperform badly on one ormore of the test cases, the combined AHFS algorithm steadily provides noticeably better solution. In comparison,modellingaccuracyimprovementpercentagesvaryfrom139%

to 304% with the overall average of 183%, which concludes that theAHFSnearly doublestheaccuracy(resultinginaroundhalfvalue fortherelated modellingerror)comparedto theindividual methods, makingitasuperiorfeatureselectionalgorithm.

Since theAHFS mustcalculateall ofthemeasures usedby all oftheincorporatedfeatureselectionalgorithms,itscomputational requirement isalways higher than therequirements ofthe other algorithms, individually andtogether,as well. Moreover, it incor- poratesmodeltrainingstepswhichalsohavemoresigniﬁcanttime demand inmanycases.The requiredcomputationalneed isvary- ingbetween171%and7571%comparedtotheaverageneedofthe

previous,individualsolutions,allinallinaverageitstimerequire- mentratiois2246%.

Attheﬁnal evaluationstage,AHFSwascomparedtoﬁve,com- pletely independent, recent, well performing Relief-based feature selection methods. It was measured that the proposed AHFS algo- rithm outperformedtherecent sate-of-the-artfeatureselection algo- rithms, moreover,itserves with thesame modeling accuracyas the ideal exhaustive search algorithm, but its computational demandis smallerinseveralmagnitudes.

Ithastobeemphasizedthattheproposedalgorithmwasalready successfullyappliedintheindustrialcollaborationsoftheauthorsbe- forewritingthispaper.The modelspreparedaccordingtothere- sultsofAHFSarealreadyincorporatedinthecontrolsystemofthe relatedmachinesandworkswell (detectsdifficultidentifiablesit- uations)ontheshopfloor,inthedailyproduction.

Future research is needed to reduce the computational time of the AHFS algorithm, which is currently a disadvantage compared to the individual filter methods. Drawbacks can be elimi- natedby usingadifferent,notgreedysearchstrategythat isable to”thinkahead” and,atthesametime, reducethenumberofre- quiredmodeltraining.Moreover,itwouldbeveryusefulhavinga generalmetric substitutingthe modelbased evaluation, however, thisproblemseems tobeextremely difficult.Finally,theselective involvement(calculation)oftheincorporatedindividualfeaturese- lectionalgorithmsattheindividualsearchstepscouldincreasethe speedofAHFSsignificantly,thisapproachininvestigatedcurrently bysomeoftheauthorswithsomefirstsuccessalready.

Allinall,theproposedAHFSalgorithmalreadyprovedtobe superiortootherstate-of-the-artfeatureselectionmethodsfor thereasonsthat,1)itissigniﬁcantlylesssensitivetothevary- ingpropertiesofthedatasetitisappliedto,and2)itprovides asigniﬁcantlybetterfeatureorderformodelbuilding.

DeclarationofCompetingInterest

Theauthorsdeclarethattheyhavenoknowncompetingﬁnan- cialinterestsorpersonalrelationshipsthatcouldhaveappearedto inﬂuencetheworkreportedinthispaper.

Acknowledgements

The research in this paper was supported by the European CommissionthroughtheH2020projectEPIC(https://www.centre- epic.eu/)undergrantNo.739592,bythegrantoftheHighlyIndus- trialised Region in Western Hungary with limited R&D capacity:

”Strengthening of the regional research competencies related to future-orientedmanufacturingtechnologiesandproductsofstrate- gicindustriesbyaresearchanddevelopmentprogramcarriedout incomprehensivecollaboration”,undergrantNo.VKSZ_12-1-2013- 0038,by theHungarianED_18-2-2018-0006grant ona “Research on primeexploitation ofthe potential provided by the industrial digitalisation” and by the Ministry ofInnovation andTechnology NRDIOﬃcewithin theframeworkoftheArtiﬁcialIntelligenceNa- tionalLaboratoryProgram.

References

[1] I. Guyon , A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003) 1157–1182 .

[2] M.S. Srivastava , M.N. Joshi , M.M. Gaur , A review paper on feature selection methodologies and their applications, International Journal of Engineering Re- search and Development 7 (2013) 57–61 .

[3] S. Muñoz-Romero , A. Gorostiaga , C. Soguero-Ruiz , I. Mora-Jiménez , J.L. Rojo-Ál- varez , Informative variable identiﬁer: expanding interpretability in feature selection, Pattern Recognit 98 (2020) .

[4] R. Shang , Y. Meng , W. Wang , F. Shang , L. Jiao , Local discriminative based sparse subspace learning for feature selection, Pattern Recognit. 92 (2019) 219–230 . [5] Y. Zhang , Q. Wang , D.-w. Gong , X.-f. Song , Nonnegative laplacian embedding

guided subspace learning for unsupervised feature selection, Pattern Recognit.

93 (2019) 337–352 .

(13)

[6] L. Zini , N. Noceti , G. Fusco , F. Odone , Structured multi-class feature selection with an application to face recognition, Pattern Recognit. Lett. 55 (2015) 35–41 . [7] Y. Jiang , C. Li , MRMR-based feature selection for classiﬁcation of cotton for- eign matter using hyperspectral imaging, Comput. Electron. Agric. 119 (2015) 191–200 .

[8] Z. Zhang , H. Chen , Y. Xu , J. Zhong , N. Lv , S. Chen , Multisensor-based real-time quality monitoring by means of feature extraction, selection and modeling for al alloy in arc welding, Mech. Syst. Signal Process. 60–61 (2015) 151–165 . [9] K. Zhang , Y. Li , P. Scarf , A. Ball , Feature selection for high-dimensional ma-

chinery fault diagnosis data using multiple models and radial basis function networks, Neurocomputing 74 (2011) 2941–2952 .

[10] J.A. Carta , P. Cabrera , J.M. Matías , F. Castellano , Comparison of feature selection methods using ANNs in MCP-wind speed methods. a case study, Appl. Energy 158 (2015) 490–507 .

[11] X. Kong , X. Liu , R. Shi , K.Y. Lee , Wind speed prediction using reduced support vector machines with feature selection, Neurocomputing 169 (2015) 449–456 . [12] J. Ircio, A. Lojo, U. Mori, J. Lozano, Mutual information based feature subset selection in multivariate time series classiﬁcation, Pattern Recognit. 108 (2020) 107525, doi: 10.1016/j.patcog.2020.107525 .

[13] S.-y. Jiang , L.-x. Wang , Eﬃcient feature selection based on correlation measure between continuous and discrete features, Inf. Process. Lett. 116 (2016) 203–215 .

[14] B. Senliol , G. Gulgezen , L. Yu , Z. Cataltepe , Fast correlation based ﬁlter (FCBF) with a different search strategy, 2008 .

[15] Y. Jiang , C. Li , A fault diagnosis scheme for planetary gearboxes using modi- ﬁed multi-scale symbolic dynamic entropy and mRMR feature selection, Mech.

Syst. Signal Process. 91 (2017) 295–312 .

[16] S. Sharmin , M. Shoyaib , A .A . Ali , M.A .H. Khan , O. Chae , Simultaneous feature selection and discretization based on mutual information, Pattern Recognit. 91 (2019) 162–174 .

[17] F. Amiri , M.R. Youseﬁ, C. Lucas , A. Shakery , N. Yazdani , Mutual information-based feature selection for intrusion detection systems, Journal of Net- work and Computer Applications 34 (2011) 1184–1199 .

[18] R. Battiti , Using mutual information for selecting features in supervised neural net learning, IEEE Trans. Neural Networks 5 (4) (1994) 537–550 .

[19] J. Song , Z. Zhu , P. Scully , C. Price , Modiﬁed mutual information-based feature selection for intrusion detection systems in decision tree learning, J. Comput.

(Taipei) 9 (7) (2014) 1542–1546 .

[20] L. Yu , H. Liu , Feature selection for high-dimensional data: A fast correlation-based ﬁlter solution, 2003 .

[21] Y. Liu , J. Zhang , L. Ma , A fault diagnosis approach for diesel engines based on self-adaptive WVD, improved FCBF and PECOC-RVM, Neurocomputing 177 (2016) 600–611 .

[22] Y. Jiang , C. Li , MRMR-based feature selection for classiﬁcation of cotton for- eign matter using hyperspectral imaging, Comput. Electron. Agric. 119 (2015) 191–200 .

[23] H.H. Yang , J. Moody , Feature selection based on joint mutual information, 1999, pp. 22–25 .

[24] H.H. Yang , J. Moody , Data visualization and feature selection: New algorithms for nongaussian data, 1999, pp. 687–693 .

[25] P.A. Devijver , J. Kittler , Pattern recognition, a statistical approach, Prentice-Hall International Inc., England, London, 1982 .

[26] G. Wang , Q. Song , H. Sun , X. Zhang , B. Xu , Y. Zhou , A feature subset selection algorithm automatic recommendation method, J. Artif. Intell. Res. (JAIR) 47 (2013) 1–34 .

[27] Z.J. Viharos , Automatic generation a net of models for high and low levels of production control, 16th IFAC Word Congress, 2005 .

[28] Z.J. Viharos , G. Erdös , A. Kovács , L. Monostori , Ai supported maintenance and reliability system in wind energy production, 2010 .

[29] Z.J. Viharos , K.B. Kis , Diagnostics of wind turbines based on incomplete sensor data, XX IMEKO World Congress - Metrology for green growth, TC10 on Technical Diagnostics, 2012 .

[30] W.S. McCulloch , W.H. Pitts , A logical calculus of the ideas immanent in nervous activity, Bulletin of Mathematical Biophysics 5 (1945) 115–133 .

[31] P.J. Werbos , Beyond Regression: New Tools for Prediction and Analysis in the Behaviour Sciences, Harward University, Cambridge, 1974 .

[32] L. Deng , D. Yu , Deep learning methods and applications, Foundations and Trends in Signal Processing 7 (2014) 1–199 .

[33] M. Lichman, UCI machine learning repository, 2013, http://archive.ics.uci.edu/

ml .

[34] D. Mo , Z. Lai , Robust jointly sparse regression with generalized orthogonal learning for image feature selection, Pattern Recognit. 93 (2019) 164–178 . [35] J. Zhao, J.-m. Liang, Z.-n. Dong, D.-y. Tang, L. Zhen, Accelerating information

entropy-based feature selection using rough set theory with classiﬁed nested equivalence classes, Pattern Recognit. 107 (2020) 107517, doi: 10.1016/j.patcog.

2020.107517 .

[36] R.J. Urbanowicz , M. Meeker , W. La Cava , R.S. Olson , J.H. Moore , Relief-based feature selection: introduction and review, J. Biomed. Inform. (2018) 189–203 . Dr. Zsolt J ánános Viharos , MBA, senior research fellow and project manager of the Institute for Computer Sci- ence and Control (SZTAKI) and full researcher and lec- turer of the John von Neumann University, in Hungary.

In the Research Laboratory on Engineering and Manage- ment Intelligence at SZTAKI, he is the leader of the In- telligent Processes Research Group. He is leading various sizes of industrial, and national or European supported R&D projects with durations from some months up to many years. His typical roles are project sponsor- ship, project management and content leadership for industrial projects. He has 140 scientific publications re- sulted in 500 independent references, is member of the Boards of Reviewers of the scientific journals: Measurement, Applied Intelligence, Reliability Engineering and System Safety and is member, or chair of various scientific conferences. He is Chairperson of the TC10 - Measurement for Diagnostics, Optimization and Control of the Hungarian National IMEKO Committee. He is member of the IEEE (Institute of Electrical and Electronics Engineers), No.: 93787359, and of the International Society of Applied Intelligence, member of the Production Systems section of the Scientific Society for Mechanical Engineering (GTE) in Hun- gary and member of the Computer and Automation Committee of the public body of the Hungarian Academy of Sciences (MTA), member of the Hungarian Standards Institution (MSZT). (please, visit: http://www.sztaki.hu/~viharos)