Improving Document Retrieval by Automatic Query ... - Semantic Scholar

3 downloads 29190 Views 116KB Size Report
Using Collaborative Learning of Term-Based Concepts. Stefan Klink, Armin Hust, Markus Junker, and Andrea s Dengel. German Research Center for Artificial Intelligence. (DFKI, GmbH) ..... call/precision points which are displayed in recall.
ImprovingDocument Retrieval byAutomatic QueryExpansion UsingCollaborative LearningoTerm-Based f Concepts Stefan Klink,Armin Hust,Markus Junker,andAndrea

Dengel s

GermanResearchCenter for ArtificialIntelligence (DFKI,GmbH) P.O.Box 2080,67608Kaiserslautern,Germany {Stefan.Klink, Armin.Hust, Markus.Junker, Andreas.Dengel}@dfki.de

http://www.dfki.de/~klink

Abstract.Queryexpansionmethodshavebeenstudiedforalo ngtime –with debatablesuccessinmanyinstances.Inthispaper, na ewapproachips resented basedonusingtermconceptslearnedbyotherqueri es.Twoimportantissues withqueryexpansionareaddressed:theselectiona ndtheweighingofadditionalsearchterms.Incontrasttoothermethods, theregardedqueryisexpandedbyaddingthosetermswhicharemostsimilar tothe conceptofindividualqueryterms,ratherthanselectingtermsthata resimilartothecomplete queryotrhataredirectlysimilartothequeryter ms.Experimentshave shown thatthiskindofqueryexpansionresultsinnotabl eimprovementsotfheretrievaleffectivenessimeasured f therecall/precis ionincomparisontothestandardvectorspacemodelandtothepseudorelevance feedback.Thisapproach canbeusedtoimprovetheretrievalofdocumentsi nDigitalLibraries,in DocumentManagementSystems,inthe WWW etc.

Introduction 1 AstheInternetandDigitalLibrariesbecomemorea ndmorepopular,thegrowing numberofdocumentshasraisedtheproblemcalledi nformationoverload.Typical searchenginesindexbillionsopages f across var a iety ocategories, f andreturnresults rankedbyexpectedtopicalrelevance.Butonly sm a allpercentageothese f pagesmay beosafpecificinterest. InInformationRetrieval(IR) thenumberorfetrieveddocumentsirselatedtothe numberoaf ppropriatesearchterms.Retrievalwith shortqueriesistypicalinWeb search[13],butitismuchharderacsomparedtor etrievalwithlongqueries.Thisis becauseshorterqueriesoftenprovidelessinformat ionforretrieval.ModernIRsystems thereforeintegratethesaurus browsers tofind additionalsearch terms [24]. However,theaimoftheretrieval activityinsottoretrievelarge a numberofdocuments.Rather,users areinterestedihigh an usefu lness of theretrieveddocuments. AnotherproblemwhichitsypicalfortheWebandfo D r igitalLibrariesitshatthe terminologyusedindefiningqueriesios ftendiffe renttotheterminologyusedinthe

representingdocuments.Evenifsomeusershavethe sameinformationneedthey rarelyusethesameterminologyintheirqueries.M any intelligentretrievalapproaches [5, 18, 23]havetriedtobridgethis terminologicalgap. Researchonautomaticqueryexpansion(ormodificat ion)wasalreadyunderway beforethe60’swheninitialrequestswereenlarged inthegroundsosftatisticalevidence[30].Theideawastoobtainadditionalrelev antdocumentsthroughexpanded queriesbasedontheco-occurrenceoftheterms.Ho wever,thiskindofautomatic queryexpansionhasnotbeenverysuccessful.Ther etrievaleffectivenessotfheexpandedquerieswasoftennotgreaterthan,oreven lessthantheeffectivenessotfhe originalqueries [21, 22, 31]. Oneideainvolvestheuseoarelevance f feedbacke nvironmentwherethesystem retrievesdocumentsthatmayberelevanttoauser’ qs uery.Theuserjudgestherelevanceoone f omore r othe f retrieveddocumentsand thesejudgmentsarefedbackto thesystem toimprovetheinitialsearch result.Th is cycleorelevance f feedback can be iterateduntiltheuseris satisfiedwith theretri eveddocuments.In this case,wecan say thatthemorefeedbackigs iventothesystemtheb etteristhesearcheffectivenessof thesystem.Thisbehaviorisverifiedby[4].Heha shownthattherecall/precision effectiveness is proportionaltothelog othe f num ber of relevantfeedback documents. Butinatraditionalrelevancefeedbackenvironment theuservoteddocumentsare appropriatetothecompletequery.Thatmeansthat thecompletequeryiasdaptedto theusersneeds.Ifanotheruserhasthesameinten tionbutuses different a terminology orjustonewordmoreolressinhisquerythenthe traditionalfeedbackenvironment doesn’trecognizeany similarities in thesesituati ons. Anotherideatosolvetheterminologyproblemisto usequeryconcepts.Thesystemcalled’RuleBasedInformationRetrievalbyCom puter’(RUBIC)[1, 5, 18]uses productionrulestocaptureuserquery concepts.In RUBIC, saetof relatedproduction rulesisrepresentedasanAND/ORtree,calledaru lebasetree.RUBICallowsthe definitionofdetailedqueriesstartingataconcep tuallevel.Theretrievaloutputis determinedbyfuzzyevaluationoftheAND/ORtree. Tofindproperweightvalues, KimandRaghavandevelopedneural a network(NN)mo delinwhichtheweightsfor therulescanbeadjustedbyusers’relevancefeedb ack.Theirapproachisdifferent fromthepreviousNNapproachesforIRintwoaspec ts[12, 14].First,theyhandle relationsbetweenconceptsandBooleanexpressions inwhichweightedtermsare involved.Second,theydonotusetheirownnetwork modelbutanalreadyproven modelin terms of its performance. Butthecrucialproblem of raule-basedsystem stil exists: l theautomaticproduction of proper rules andthelearning oappropriate f str uctures of rules,notjusttheweights.

Query 2 Expansion Thecrucialpointinqueryexpansionitshequestio beincludedinthequeryformulation?Itfhequery

n:Whichterms(orphrases)should formulationitsobe xpandedby

additionaltermstherearetwoproblemsthatareto besolved,namelyhowarethese terms selectedandhow aretheparameters estimated for theseterms. Manytermsusedinhumancommunicationareambiguou os hr aveseveralmeanings[20].Butinmostcasestheseambiguitiesare resolvedautomaticallywithout noticingtheambiguity.Thewaythisids onebyhum ansistillanopenproblemof psychologicalresearch,butitisalmostcertain,t hatthecontextinwhich term a occurs plays caentralrole. Mostattemptsaautomatically t expandingqueriesfa iledtoimprovetheretrievaleffectivenessandiwas t oftenconcludedthatautomat icquery expansionbasedonstatisticaldatawas unabletoimprovetheretrievaleffe ctiveness substantial[22]. Butthiscouldhaveseveralreasons.Term-basedque ryexpansionapproachesare mostlyusinghand-madethesauriorjustplainco-oc currencedata.Theyoftendonot uselearningtechnologiesforthequeryterms.Ont heotherhand,thosewhouselearningtechnologies(NeuralNetworks,SupportVectorM achines,etc.)arequery-based. Thatmeans thesesystems learn concepts (or additio nalterms)for thecompletequery. Incontrasttolearningcompletequeries,thevital advantageouf singterm-based conceptsitshatotheruserscanprofitfromlearne dconceptsevenithe f samequeryis neverusedbefore.Astatisticalevaluationolfog fileshasshownthattheprobability thatasearcherusesexactlythesamequerythana previoussearcherismuchlower thentheprobabilitythatpartsothe f query(phras esoterms) r occursinpreviousqueries.So,evenisafearcherneverusedthegivens earchterm,theprobabilitythatother searchers hadusediis very t high andthen hcean profitfrom thelearnedconcept.

Traditional 3 Document Retrieval Thetaskotraditional f documentretrievalistore trievedocumentswhicharerelevant toagivenqueryfromafixedsetofdocuments,i.e ad. ocumentdatabase.Inacommonway todealwithdocumentsaw s ellasqueries, they arerepresentedusing set a of indexterms(simplycalledterms)byignoringtheir positionsindocumentsandqueries.Termsaredeterminedbasedonwordsofdocume ntsinthedatabase,usually duringpre-processingphaseswheresomenormalizati onproceduresareincorporated (e.g.stemming andstop-wordelimination). Inthefollowing,t (1 ≤i ≤M)and dj (1 ≤j ≤N)representtaermanddocument a i inthedatabase,respectively,whereMisthenumbe of rtermsandNisthenumberof documents. Vector 3.1 SpaceModel Themostpopularandthesimplestretrievalmodeli sthevectorspacemodel (VSM)[5].In theVSM, daocument dis represented aM asdimensionalvector j dj =(w where Tindicatesthetranspose, query ilikewise s representedas

,w 1j…,

T Mj)

wijisaweightoafterm

(1) tin document ia

djA .

q=(w k

, 1k…,w

T Mk) ,

,w ik…,

1 ≤k ≤L

(2)

where wis w a eightof taerm in tquery ai q k. ik Theseweightsarecomputedbythestandardnormaliz [27]as follows: w= ijtf

*iijdf

edtf · idfweightingscheme (3)

i

where tfis weightcalculatedusingthetermfrequency fand idfisti heweight ijthe ij calculatedusing theinverseothe f documentfreque ncy. Theresultoftheretrievalisrepresentedalis as of tdocumentsrankedaccordingto theirsimilaritytothequery.Thesimilarity sim(djq, kb) etweenadocument dand j a query qis measured b t y he s tandard c osine o the f a ngle b e tween d and q : k j k sim(dqj, k= )

dTqj k dj qk

(4)

where ·ithe s Euclidean norm of vaector. Pseudo 3.2 RelevanceFeedback Awell-knownmethodtoobtainthetermsforaquery vancefeedback[18].Here,inafirststep,documen query,likeintheVSM.Then,thehighly rankeddoc andtheirtermsareincorporatedintooriginalquer arerankedagain buysing thenew expandedquery. Inthispaper,weemploy(likeinKiseeal. t [15]) relevancefeedback: Let ¸be set a of documentvectors for expansion given b ¸=

ì + ídj î

expansionisthepseudoreletsarerankedwithanoriginal uments areassumedtobreelevant y.Inthesecondstep,thedocuments asimplevariantofthepseudo

ï sim(d j , q) di , q) ï maxsim( i +

ü þ

y

≥θ ý

is τthreshold a othe f similarity.Thesum

where qiasnoriginalquery vectorand thedocumentvectors in

å d+ j

d=s

q’=

q + q

d α d s s

where αisaparameterforcontrollingtheweightofthen ponent.Finally,thedocumentsarerankedagainacc totheexpandedquery.

dof s

(6)

d j ∈¸ +

canbeconsideredaesnrichedinformationaboutthe pandedquery vector q’is obtainedby

(5)

originalquery.Then,theex-

(7) ewlyincorporatedcomordingtothesimilarity sim(dq’ ) j,

Learning 4 Term-basedConcepts AproblemofthestandardVSMisthataqueryios f appropriately.Tocopewiththisproblem,ourappro byexpandingiwith t termsoccurringinthedocumen trasttotraditionalpseudorelevancefeedbackmeth mentsareassumedtoberelevantandthenalltheir expandedquery,adifferenttechniqueius sedtoco follows:

tentooshorttorankdocuments achitsoenrichtheoriginalquery tsotfhecollection.Butinconods,wherethetop ranked i docutermsareincorporatedintothe mputetherelevantdocumentsas

Let q=t … be user query containing theterms t… 1t nthe 1t and q=(w 1…,w , …, ,i w M)T bethevector representation othis f query. Let Q={ q…, ,1 qmb}tehesetof allprevious queries q…, ,1 and D+be t he s et relevant of documents of the q uery q k k.

n

qm

Thegoalisnowtolearnforeachterm taci oncept ci (1 ≤i ≤n)with thehelpof previousqueriesandtheirappropriaterelevantdoc uments.Forthis,theterm tis i searchedinallpreviousqueriesandifiitsfound t,herelevantdocumentsotfhese queries areusedtolearn theconcept. DuetotheVSM, caonceptis alsoa weightedvector of terms andcalculatedwith: c= i

τ(0,…,w +T i i,…,0)

δi åD ti∈ qk

+ k

(8)

where0 ≤ τi ,δi ≤1areweightsfortheoriginaltermandtheadditi onalterms,respectively. Theexpandedquery vector is obtainedbtyhesum of allterm-basedconcepts:

å ci n

q’=

(9)

=1i

Beforeapplying theexpandedquery,itis normalize q’’=

dby

q’ q’

Forthisapproach,thecompletedocuments(e.g.all vector)aresummedupandaddedtothequery.Altho thatusingjustthetoprankedtermsisufficient thisapproachonourcollectionshaveshownthatth conceptsthebetteraretheresults.So,thedecisi of thedocuments andnotonly some(topranked)ter Ifnogroundtruthofrelevantdocumentsisavailab niquescanbuesedtogetthegroundtruth.Then,c from retrievedrelevantdocuments.

(10) termweightsotfhedocument ugh,in somepapers itis reported orsometimesbetter,experimentswith m e orewordsareusedtolearnthe onwasmadetousealwaysallterms ms. le,relevancefeedbacktechoncepts arelearnedbaydding terms

Experiments 5 andResults Section5.1describesthetestcollections,section ods,andsection 5.3presents theresults

5.2describesourevaluationmeth-

Test 5.1Collections Forourcomparisonwuesedfourstandardtestcolle ctions:CACM(collectionotif tlesandabstractsfrom thejournal‘Communications oftheACM’),CR(congressional report),FR(federalregister)andNPL(alsoknown astheVASWANI).ThesecollectionsarecontainedintheTRECdisks[9].Allcoll ectionsareprovidedwithqueries andtheirgroundtruth(alistofdocumentsrelevan to teachquery).Forthesecollections,termsusedfor documentrepresentationwereobtainedbystemminga ndeliminating stopwords. Table1listsstatisticsaboutthecollectionsafte rstemmingandeliminatedstop words.Inadditiontothenumberofdocuments,adi fferenceamongthecollectionsis thedocumentlength:CACMandNPLconsistsoabstr f acts,whileCRandFRcontain much longer documents. QueriesintheTRECcollectionsaremostlyprovided inastructuredformatwith severalfields.Inthispaper,the“title”(thesho rtestrepresentation)isusedfortheCR andNPLcollectionwhereasthe“desc”(description; mediumlength)ius sedforthe CACMandFRcollection. Table 1.Statisticsaboutcollectionsusedfor experiments

#documents #queries #differentterms avg doclength [terms] avg query length [terms]

CACM 3204 52 3029 25.8 10.9

CR 27922 34 45717 672.8 3.1

FR NPL 19789 11429 112 93 50866 4415 863.7 21.8 9.8 6.6

Evaluation 5.2 Thefollowingparagraphsdescribesomebasicevalua For further information and more a detaileddescrip

tionmethodsusedinthispaper. tion seeKiseeal t[15].

5.2.1 AveragePrecision Acommonwaytoevaluatetheperformanceorfetriev (interpolated)precisionatsomerecalllevels.Thi call/precisionpointswhicharedisplayedinrecall sometimesconvenientforustohavesingle a value Theaverageprecision(non-interpolated)overallr ureresulting isingle an value.Thedefinition is

almethodsistocomputethe sresultsinanumberofre/precisiongraphs[5].However,itis thatsummarizestheperformance. elevantdocuments[5, 7]is measa as follows:

Asdescribedinsection3the , resultofretrieval isrepresentedatsherankedlistof documents.Let r(i)betherankothe f i-threlevantdocumentcountedfromthetopof thelist.Theprecisionforthisdocumentiscalcul atedby i/r(i).Theprecisionvalues foralldocumentsrelevanttoaqueryareaveraged toobtainasinglevalueforthe query.Theaverageprecisionoverallrelevantdocu mentsitshenobtainedbyaveraging therespectivevalues over allqueries. 5.2.2 StatisticalTest Thenextstepfortheevaluationistocomparethe valuesotfheaverageprecision obtainedbydifferentmethods[15].Animportantqu estionhereiw s hetherthedifferenceintheaverageprecisionirseallymeaningful orjustbychance.Inordertomake such distinction, a itis necessary toapply stat a isticaltest. Severalstatisticaltestshavebeenappliedtothe taskofinformationretrieval [11,33].Inthispaper,weutilizethetestcalled “macrot-test”[33](called pairedt-test in [11]).Thefollowing isasummary othe f testde scribedi[n15]: Let aand bbe scores(e.g.,theaverageprecision)of ret rievalmethods Aand B i ithe for qa uery and i define d=i a-i bThe canbaeppliedundertheassumptionstha t i. test + εwhere µisthepopulationmeanand εisan themodelisadditive,i.e., d=i µ i i error, µ =0( Aperandthattheerrorsarenormallydistributed.Then ullhypothesishereis formsequivalentlyto Bintermsotfheaverageprecision),andthealtern ativehyAperforms better than B). pothesis is µ >0( Itis known thattheStudent’s t-statistic =t

d ns/2

¯

(11)

followsthet-distributionwiththedegree offreedomof n–1,where ¯ 2 of samples (queries), d and sare thesamplemean andthevariance: ¯

d=

1 n

åd i , n

s=2

=1i

Bylookingupthevalueoin tfthet-distribution, probabilityoobserving f thesampleresults nullhypothesisitsrue.TheP-valueicsomparedto σinordertodecidewhetherthenullhypothesissho cancelevels,weutilize0.05and0.01.

d(1 i

1 n -1

å(d n

id

nitshenumber

¯ 2

)

wecanobtaintheP-value,i.e.,the ≤ i ≤ N)undertheassumptionthatthe a predeterminedsignificancelevel uldbreejectedonot. r Assignifi-

Results 5.3 andComparisontotheStandard Inthissectiontheresultsotfheexperimentsare usingtheaverageprecisionoverallqueries.Recal andthen significancetests wereappliedtotheres

(12)

=1i

presented.Resultswereevaluated l/precisiongraphsweregenerated ults.

5.3.1 RecallandPrecision Asdescribedabove,termweightsinbothdocuments andqueriesaredeterminedaccordingtothenormalizedtf·idfweightingschemea ndthesimilarityicsalculatedby theVSMcosinemeasure,seealsoformula(3). Theresultsothe f pseudorelevancefeedbackarede pendingonthetwoparameters β(weight)and θ(similaritythreshold).Togetthebestresults,w ewerevarying from0to5.0withstep0.1and θfrom0.0to1.0withstep0.05.Foreachcollectio thattheir averageprecision i highest: s bestindividual βand θarecalculatedsuch Table 2.Bestvaluesfor pseudorelevance feedback paramete

CACM

rs

CR

FR

NPL

(weight) β

1.70

1.50

0.30

2.00

(sim. θ threshold)

0.35

0.75

0.00

0.45

Theresultsfortheconcept-basedqueriesarecalcu vidualquery,allconceptsothe f termswithinthis querieswiththehelpoftheirrelevantdocuments( pandedqueryius sedtocalculatethenewrecall/pr vantdocuments of thecurrentquery arenotusedto Theresultsoour f concept-basedexpansionarealso experiments describeditnhis paper wejustusedth Figure1showstherecall/precisionresultsotfhe vectorspacemodel(redline,VSM),thepseudorele andtheexpandedquery using learnedconcepts (gree Therecall/precisiongraphsinfigure1indicateth methodbasedonlearnedconceptsyieldsaconsidera effectivenessin3collectionsoverallrecallpoin spacemodelandtothepseudorelevancefeedbackme lection).Thereins oindicationthattheimproveme collection,thenumberodf ocumentsnoronthenumb methodperformsgoodonCACMbutonlysomewhatbett NPL.OnFRitperformsbetterthanontheCRcollec impressioncouldarisethatourmethodperformsbet perimentswiththeCRcollectionhaveshownthat‘t sionthan‘description’o‘rnarrative’queries.Thi impression othe f figures. Additionally,asdescribedabove,themorewordsan tolearntheconceptthebetteraretheresults.Ex sion continues toincreaseamore s documents areus

latedafsollows:Foreachindiqueryarelearnedbyusingallother leave-one-outtest)andtheexecisionresult.Ofcourse,therelelearn theconcepts. dependingonweights.Forthe τi = δ=1. deefaultvalue: i originalquerywiththestandard vancefeedback(blueline,PRF) lnine,concepts). attheautomaticqueryexpansion bleimprovementintheretrieval tscomparedtothestandardvector thod(exceptwiththeNPLcolntisdependingonthesizeotfhe erosrizeotfhequeries.The erthantheVSM onthe tion.Lookingathe t figuresthe terwithlongerqueries.Butexitle’queriesresultabetterprecibs ehavioriisncontrasttothefirst dthemoredocumentsareused perimentshaveshownthattheprecied.

β n,

Fig. 1.recall precision / oCACM, f CR,FR,andNPL

5.3.2 Statisticaltests Theresultsaboveareveryexciting.Butinordert obseurethattheseresultsarereally meaningfulandnotjustbychance,itisnecessary toapplyastatisticaltest.Asdescribedabove,weusedthe“macrot-test”.Theresu ltsothe f macrot-testforallpairs ofmethodsareshownintable3The . meaningothe f symbolssucha“s”,“>”and“~” issummarizedathe t bottom of thetable.For examp le,thesymbol“(, < @ @ > P-value ≤0.01 ≤P-value ≤0.05 ≤P-value

Conclusions 6 andFuture Work Wehavedescribedanapproachforbridgingtheterm inologygapbetweenuserqueriesandpotentialanswerdocumentsusingterm-base dconcepts.Inthisapproachfor allquerytermsconceptsarelearnedfromprevious queriesandrelevantanswerdocuments.Anewqueryistransformedbyreplacingthe originalquerytermsbythe learnedconcepts.Thisiisncontrasttotraditiona query l expansionmethodswhichdo nottakeintoaccountpreviousqueriesbutmostly r ely otnerm statistics of theunderlying documentcollection ohand-made r thesauri. Ourexperimentsmadeonfourstandardtestcollecti onswithdifferentsizesanddifferentdocumenttypeshaveshownconsiderableimpro vementsvs.theoriginalqueries inthestandardvectorspacemodelandvs.thepseu dorelevancefeedback(here,exceptontheNPLcollection).Thisitsrueeventhro ughtheparametersforpseudorelevancefeedbackwereoptimized,whileforourapproa chwde idnottouchtheparameters.Theimprovementsseemnottodependonthety penoronthesizeothe f collection andthey arenot obtained by chance. The vitaladvantage oftheapproachitshatinmulti-user a scenarious erscanbenefitfromconceptslearnedbyotherusers.Theappro achiesxpectedtoperformbetter asmoreusersandqueriesareinvolved.Ict anbeu sedtoimprovetheretrievalof documents in DigitalLibraries,DocumentManagement Systems,WWWetc. Inthenearfutureiis tplannedtomakesomeexper imentsontheinfluenceotfhe δ (cmp. equation (8)), and to develop functions forcalculatingthese weights τand i i parametersforeachindividualconcept.Furtherexp erimentsareplannedusinguservotedrelevancefeedbackinsteadocollection-give f nground-truthtotest theperformanceo‘nreal-life’ data.For this wearecurrently collecting queries andclick datafrom saearch engine[10]. Theapproachonpassage-basedretrievalbyKise[15 h] asshowngoodimprovementsvs.LSIandDensityDistribution.Insteadof usingthecompleterelevantdocumentsforexpandingthequeryousing r the ntoprankedterms,aninterestingideafor thefutureitsousejusttermsorelevant f passage w s ithinthedocuments.Thisshould increasethequality othe f expandedqueries.

Acknowledgements 7 ThisworkwassupportedbytheGermanMinistryfor bmb+f (Grant:01IN902B8).

EducationandResearch,

References 1.AalbersbergI.J.:Incrementalrelevancefeedback In . Proceedingsothe f AnnualInt.ACM SIGIRConferenceonResearchandDevelopmentinInf ormationRetrieval pp. , 11 2- 2, 1992 2.AllanJ.:Incrementalrelevancefeedbackforinf ormationfiltering.In Proceedingsotfhe 19thAnnualInt.ACMSIGIRConferenceonResearchandD evelopmentinInformationRetrieval,pp.270 278, - 1996 3.AlsaffarA.H.,DeogunJ.S.,RaghavanV.V.,Sever H:Concept-basedretrievalwithminimalterm sets.InZ.W.RasandASkowon, . editors, FoundationoIntelligent f Systems:11th Int.Symposium ISMIS’99, , pp.114-122,Springer,Warsaw,Poland, June 1999 4.BuckleyC.,SaltonG.,AllenJ.:Theeffectofa ddingrelevanceinformationinarelevance th feedbackenvironment.In Proceedingsothe f 17 AnnualInt.ACMSIGIRConferenceon ResearchandDevelopmentinInformationRetrieval pp. ,292 300, - 1994 5.Baeza-YatesR.,Ribeiro-NetoB.: ModernInformationRetrieval . Addison-WesleyPub. Co.,1999.ISBN020139829X 6.CroftW.B.:Approachestointelligentinformatio nretrieval. InformationProcessingand Management,1987,Vol.23,No.4,pp.249-254 7.ftp://ftp.cs.cornell.edu/pub/smart/ 8.HarmanD.:TowardsInteractiveQueryExpansion. In:ChiaramellaY(editor): . 11thInternationalConferenceonResearchandDevelopmentin InformationRetrieval p, p.321 – 331,Grenoble,France,1988 9.http://trec.nist.gov/ 10.http://phibot.org/ 11.HullD.:UsingStatisticalTestingintheEvalu ationoRetrieval f Experiments.In Proceedth ingsothe f 16 AnnualInt.ACMSIGIRConferenceonResearchandD evelopmentinInformationRetrievalpp. ,329 338, - 1993 12.IwayamaM.:RelevanceFeedbackwithaSmallNum beroR f elevanceJudgments:Increrd mentalRelevanceFeedbackvs.DocumentClustering. In Proceedingsothe f 23 Annual Int.ACMSIGIRConferenceonResearchandDevelopme ntinInformationRetrieval pp. , 10 16, - Athens,Greece,July 2000 13.JansenB.J.,SpinkA.,BatemanJ.andSaracevic T.:RealLifeInformationRetrieval:A Study oUser f Queriesonthe Web,In SIGIRForum Vol. , 31,pp.5-17,1988 14.KimM.,RaghavanV.:Adaptiveconcept-basedRet rievalUsingaNeuralNetwork,In ProceedingsoA f CMSIGIRWorkshoponMathematical/F ormalMethodsinInformation Retrieval,Athens,Greece,July 2000 15.KiseK.,JunkerM.,DengelA.,MatsumotoK.:Pa ssage-BasedDocumentRetrievalasa th ToolforTextMiningwithUser’sInformationNeeds, In Proceedingsotfhe4 InternationalConferenceoD f iscoveryScience pp. , 155-169,Washington,DC,USA,November 2001

16.KwokK.:QueryModificationandExpansionina NetworkwithAdaptiveArchitecture.In th Proceedingsothe f 14 AnnualInt.ACMSIGIRConferenceonResearchandD evelopment inInformationRetrieval pp. ,192 201, - 1991 17.LuF.,JohnstenTh.,RaghavanV.V.,TraylorD.: EnhancingInternetSearchEngines to Achieve Concept-basedRetrieval,In ProceedingsofInforum’99 Oakridge, , USA 18.ManningC.D.andSchützeH.: FoundationsoStatistical f NaturalLanguageProcess ing, MIT Press,1999 19.MaglanoV.,BeaulieuM.,RobertsonSEvalua .,: tionointerfaces f forIRS: modelingenduser searchbehaviour. 20thColloquiumonInformationRetrieval Grenoble, , 1988 20.McCuneB.P.,TongR.M.,DeanJ.S.,ShapiroD.G. RUBRIC: : ASystemforRule-Based InformationRetrieval,In IEEETransactiononSoftwareEngineering Vol. , SE-11,No.9, September 1985 21.MinkerJ.,Wilson,G.A.Zimmerman,B.H.:Aneva luationoquery f expansionbytheadditionoclustered f termsfor da ocumentretrievalsy stem, InformationStorageandRetrieval , vol.8(6),pp.329-348,1972 22.PeatH.J.,Willet,P.:Thelimitationsotferm co-occurrencedataforqueryexpansionin documentretrievalsystems, Journalofthe ASIS vol. , 42(5),pp.378-383,1991 23.PirkolaA.:StudiesonLinguisticProblemsand MethodsinTextRetrieval:TheEffectsof AnaphorandEllipsisResolutioninProximitySearch ing,andTranslationandqueryStructuringMethodsinCross-LanguageRetrieval, PhDdissertation Department , ofInformation Studies,University oTampere. f ActaUniversitatis Tamperensis672.ISBN951-44-4582-1; ISSN1455-1616.June 1999 24.QiuY.:ISIR:anintegratedsystemforinformat ionretrieval,In Proceedingso1f 4 thIR Colloqium,BritishComputer Society,Lancaster,1992 25.vanRijsbergenC.J.,HarperD.H.,etal.:TheSe lectionoGood f SearchTerms. Information ProcessingandManagement 17,pp.77-91,1981 26.ResnikP.:Usinginformationcontenttoevaluat esemanticsimilarityinataxonomy.In th Proceedingsofthe 14 Int.JointConference onArtificialIntelligence pp. , 448-453,1995 27.SaltonG.,BuckleyC.:Termweightingapproache isn automatictextretrieval.Information Processing& Management 24(5),pp.513 523, - 1988 28.SaltonG.,BuckleyC.:ImprovingRetrievalPerf ormancebyRelevanceFeedback. Journal ofthe AmericanSociety forInformationScience 41(4),pp.288 –297,1990 nd 29.SandersonM.,CroftB.:Derivingconcepthierar chiesfromtext.In Proceedingsothe f 22 AnnualInt.ACMSIGIRConferenceonResearchandDe velopmentinInformationRetrieval, pp.206 213, - Berkeley,CA,August1999 30.Sparck-JonesK.:Notesandreferencesonearly classificationwork. In SIGIRForum vol. , 25(1),pp.10-17,1991 31.SmeatonA.F.,vanRijsbergenC.J.:Theretrieva effects l oquery f expansiononfeedback a documentretrievalsystem. The ComputerJournal vol. , 26(3),pp.239 –246,1983 32.StuckyD.,:UnterstützungderAnfrageformulieru ngbeiInternet-Suchmaschinendurch UserRelevanceFeedback, diplomathesis German , ResearchCenterofArtificialIntelligence (DFKI),Kaiserslautern,November 2000 33.YangYand . LiuX.:ARe-ExaminationoText f Ca tegorizationMethods.In Proceedingsof the22 ndAnnualInt.ACMSIGIRConferenceonResearchandD evelopmentinInformation Retrieval, pp.42 49, - Berkeley,CA,August1999