Unsupervised Feature Selection for Multi-Cluster Data

《Unsupervised Feature Selection for Multi-Cluster Data》论文全篇

UnsupervisedFeatureSelectionforMulti-ClusterData

DengCai

ChiyuanZhang

XiaofeiHe

{dengcai,xiaofeihe}@http://www.mianfeiwendang.com,pluskid@http://www.mianfeiwendang.com

ABSTRACT

Inmanydataanalysistasks,oneisoftenconfrontedwithveryhighdimensionaldata.Featureselectiontechniquesaredesignedto ndtherelevantfeaturesubsetoftheoriginalfeatureswhichcanfacilitateclustering,classi cationandre-trieval.Inthispaper,weconsiderthefeatureselectionprob-leminunsupervisedlearningscenario,whichisparticularlydi cultduetotheabsenceofclasslabelsthatwouldguidethesearchforrelevantinformation.Thefeatureselectionproblemisessentiallyacombinatorialoptimizationproblemwhichiscomputationallyexpensive.Traditionalunsuper-visedfeatureselectionmethodsaddressthisissuebyselect-ingthetoprankedfeaturesbasedoncertainscorescom-putedindependentlyforeachfeature.Theseapproachesne-glectthepossiblecorrelationbetweendi erentfeaturesandthuscannotproduceanoptimalfeaturesubset.InspiredfromtherecentdevelopmentsonmanifoldlearningandL1-regularizedmodelsforsubsetselection,weproposeinthispaperanewapproach,calledMulti-ClusterFeatureSelection(MCFS),forunsupervisedfeatureselection.Speci cally,weselectthosefeaturessuchthatthemulti-clusterstructureofthedatacanbebestpreserved.Thecorrespondingop-timizationproblemcanbee cientlysolvedsinceitonlyinvolvesasparseeigen-problemandaL1-regularizedleastsquaresproblem.Extensiveexperimentalresultsovervari-ousreal-lifedatasetshavedemonstratedthesuperiorityoftheproposedalgorithm.

StateKeyLabofCAD&CG,CollegeofComputerScience

ZhejiangUniversity,China

1.INTRODUCTION

Inmanyapplicationsincomputervision,patternrecog-nitionanddatamining,oneisoftenconfrontedwithveryhighdimensionaldata.Highdimensionalitysigni cantlyin-creasesthetimeandspacerequirementsforprocessingthedata.Moreover,variousdataminingandmachinelearningtasks,suchasclassi cationandclustering,thatareana-lyticallyorcomputationallymanageableinlowdimensionalspacesmaybecomecompletelyintractableinspacesofsev-eralhundredorthousanddimensions[12].Toovercomethisproblem,featureselectiontechniques[3,4,17,21,29,30]aredesignedtoreducethedimensionalityby ndingarelevantfeaturesubset.Onceasmallnumberofrelevantfeaturesareselected,conventionaldataanalysistechniquescanthenbeapplied.

Basedonwhetherthelabelinformationisavailable,fea-tureselectionmethodscanbeclassi edintosupervisedandunsupervisedmethods.Supervisedfeatureselectionmeth-odsusuallyevaluatetheimportanceoffeaturesbythecor-relationbetweenfeaturesandclasslabel.Thetypicalsuper-visedfeatureselectionmethodsincludePearsoncorrelationcoe cients[23],Fisherscore[12],andInformationgain[11].However,inpractice,thereisusuallynoshortageofunla-beleddatabutlabelsareexpensive.Hence,itisofgreatsigni cancetodevelopunsupervisedfeatureselectionalgo-rithmswhichcanmakeuseofallthedatapoints.Inthispaper,weconsidertheproblemofselectingfeaturesinunsu-pervisedlearningscenarios,whichisamuchharderproblemduetotheabsenceofclasslabelsthatwouldguidethesearchforrelevantinformation.

Thefeatureselectionaimsatselectingthemostrelevantfeaturesubsetbasedoncertainevaluationcriteria.Thisproblemisessentiallyacombinatorialoptimizationprob-lemwhichiscomputationallyexpensive.Traditionalfea-tureselectionmethodsaddressthisissuebyselectingthetoprankedfeaturesbasedonsomescorescomputedinde-pendentlyforeachfeature.Thescoresareusuallyde nedtore ectthepowerofeachfeatureindi erentiatingdi er-entclasses/clusters.Thisapproachmayworkwellonbinaryclasses/clustersproblems.However,itisverylikelytofailinmulticlasses/clusterscases.Fig.(1)showsanintuitiveexample.TherearethreeGaussiansinathreedimensionalspace.Withoutthelabelinformation,somepopularunsu-pervisedfeatureselectionmethods(e.g.,MaximumvarianceandLaplacianScore[17])rankthefeaturesasa>b>c.Ifoneisaskingtoselecttwofeatures,thesemethodswillselectfeaturesaandb,whichisobviouslysub-optimal.Whendeal-ingwithmulticlasses/clustersdata,di erentfeatureshave

CategoriesandSubjectDescriptors

I.5.2[PatternRecognition]:DesignMethodology—Fea-tureevaluationandselection

GeneralTerms

Algorithms,Theory

Keywords

Featureselection,Unsupervised,Clustering

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforpro torcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthe rstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspeci cpermissionand/orafee.

KDD’10,July25–28,2010,Washington,DC,USA.

Copyright2010ACM978-1-4503-0055-1/10/07...$10.00.

Word文档免费下载Word文档免费下载:Unsupervised Feature Selection for Multi-Cluster Data (共10页,当前第1页)

Unsupervised Feature Selection for Multi Cluster Data相关文档

最新文档

返回顶部