ࡱ> &(%S bjbj "8xx%l....^^^8Dt#>(*"LLL"""""""$$ &#@#c!..LL .#c!c!c!.<LL"c!"c!c!!j|!L py#d ^!!D#0t#!R'R'!c!....CS 276a Problem Set #2 Assigned: Tuesday, November 9, 2004 Due: Thursday, November 18, 2004 by 5:30 p.m. Review session: Friday, November 12, 2004, 3:15-4:05 p.m. in Gates B01 Delivery: Local students should hand their solutions to Dan or Louis in class or leave them under Professor Mannings door (158). SCPD students should use the SCPD courier system. Late policy: Refer to the course webpage. Honor code: Please review the collaboration and honor code policy on the course webpage. Also note: because some questions may be drawn from previous years problem sets or exams, students are forbidden to consult solution sets from previous years unless we explicitly provide them. 7 questions and 90 points total. #1. Probabilistic models (15 points) We are given a corpus consisting of the following two documents. We calculate probabilities based on the corpus as a whole. "the martian has landed" "the latin pop sensation ricky martin" a. Under a unigram probability model, what are P("the") and P("martian")? b. Under a bigram model what are P("sensation"|"pop") and P("pop"|"the")? c. What are P("ricky martian") and P("ricky martin") under a unigram probability model? bigram? d. Consider P("pop martian") and P("pop martin"). Which should be higher? Does a unigram model agree with this judgment? What about a bigram model? If neither one agrees, suggest another probability model that would. e. What is unigram P("the red martian has landed")? If this is unreasonable, suggest a way to fix it. f. How can we use the language model to "correct" user queries ( la Google's "Did you mean") even when none of the words are misspelled? For example, we might want "ricky martian" corrected to "ricky martin". Efficiency does not matter. (Hint: use the lexicon.) #2. Query expansion (10 points) Compare the following two techniques for query expansion: synonym expansion using a thesaurus, and automatic query expansion by common terms/phrases in top-ranked documents.Describe advantages and disadvantages of each approach. (You should aim to provide at least 3 clear points of differentiation.) #3. Rocchio (15 points) If our document vectors are normalized, then they live on the unit hypersphere, and we can reasonably represent a small patch of the unit hypersphere as a flat surface (just as with maps in geography!). The below picture shows such a depiction of an initial query vector, the best hits, and whether they were judged relevant or irrelevant. a. Consider now applying Rocchio's algorithm. For fixed alpha = 10 and zero gamma, consider increasing beta. Redraw the figure above and draw a curve showing the position of the reformulated query for different values of beta. Use the values beta = 0, 10, 100 and label those points on the curve. The points drawn don't have to be in precisely the right place, but they need to be qualitatively correct. b. Suppose now that gamma is set to the same value as beta. Where will the reformulated query be for beta = gamma = 0, 10, 100? c. One way to construct the Rocchio relevance feedback query is as the optimal separator between relevant and non-relevant documents. (Note that this is the theoretical optimal version of the Rocchio query and is different from the "practical" one assumed in parts a and b.) Optimality is defined as follows. Let r1,...,rm be the length-normalized vectors for the relevant documents, and n1,...,nk be thelength-normalized vectors for the non-relevant documents. Then the Rocchio-optimal query ( maximizes the difference between the average correlation of relevant and non-relevant documents:  EMBED Equation.3  where m is the number of relevant and k is the number of non-relevant documents. Given this definition, how is ( computed?Show your work and justify your final answer. #4. Evaluation (10 points) For this exercise, we define the precision-recall graph of a result list as the set of (precision/recall) points, where one precision/recall point is computed for each additional returned document. We will initially define the breakeven point as the point where precision equals recall. a. Can there be more than one breakeven point?If yes, give an example; if not, show why not. b. Some precision recall graphs do not have a breakeven point as defined above. An alternative definition is: The breakeven point is the point with the smallest difference between precision and recall among all those that have larger precision than recall. Write a rough pseudocode program that computes the breakeven point. You should not need more than a few lines. #5. Nave Bayes classifiers (15 points) A common strategy with building Nave Bayes classifiers is to add one parameter whereby we modify the classification function so that the the prior term is raised to some power t.The formula is now:  EMBED Equation.3  a. In general, why might this be a useful thing to do? (Consider the relative impact of the prior class distribution and the multinomial evidence distribution in a Nave Bayes model.) b. In particular, suppose a certain multinomial Nave Bayes classifier generates the following confusion matrix: predicted abcda50262012b3820c0040d0101 (A confusion matrix represents the classifications predicted by a classifier versus the correct classifications. For example, in this matrix there were 108 documents from class a, of which the classifier correctly identified 50.) What changes in performance would you expect in the confusion matrix for different values of t? (Assume the test data has the same distribution as the training data.) #6. Centroids (10 points) Is the centroid of a cluster of normalized vectors normalized? If so, write a short proof. If not, provide a counterexample. #7. K-means (15 points) a. Consider the k-means algorithm with k = 2. The two initial centroids are chosen uniformly at random without replacement from the set of points. Give an example in a two-dimensional plane with the minimal number of points such that we can end up with two different clusterings. No examples with ties are allowed for this exercise. Give the coordinates of the points and specify both possible clusterings. Use Euclidean distance for distance computations. b. What are the probabilities of the different clusterings under uniform random choice?     PAGE  PAGE 2 x documents identified as non-relevant o documents identified as relevant ( original query hits from original search ( o o o x x x x actual !=Al{iuH  H I K b d e  &'jkqr78KLMN 2NO¹ܚCJ jEHUj{:E CJUVaJnHtH jUjEHPJUjA UV jPJUPJ jrPJH*jUmHnHtHu5 PJnHtH5PJnHtH>*5\<<=klhiuvLM   I J K d aa   4CMNPRTVX$If$a$XY[^adgz<ttttt$If$$IfTl rT  4 lafghjlnprsuwy{}~z,tttttz,tttttz,$If$$IfTlrT  4 laf ~56trrrrrrr$$IfTl rT  4 laf$If 5=>abcrtuʽ{llB*CJ0OJQJ^JaJ0phB*CJ(aJ(ph jDB*CJ(aJ(phB*CJ0aJ0phB*OJQJ^JaJ(ph jDB*OJQJ^JaJ(ph B*aJ0phB*OJQJ^JaJ0phB*aJ0mH phsH !B*OJQJ^JaJ0mH phsH 0JmHnHu0J j0JU jU5*=astu7$8$H$h]h&`#$7$8$H$B*CJ0OJQJaJ0phB*CJ0OJQJ^JaJ0ph 1h/ =!"#$%Dd,J  C A? "2j]F@gļFD7`!>]F@gļ8@p( xڕ1L@߽ %N`HY,6`Tv@*0t$H)E DHйCBbRinUAŽww9qt}޻w1@O,%2&- (GZi,6t6Lż?'u1"vcHyDmR-pJAOXb*c3K [=߰HyQ?U}1HL/rL4&>1I\2|qs}?~Nk\^x^t__T<,U+FO֔OoY+{͏]2y'%iެ;? o+vR{;^Y#]L[Y+ "wio Z:; *t)NRn:1h/!Y!^ Y^nPs򄞁|"׌u2fKڗwC-,V|Ddp 0\  c $A? ?#" `2An29;J,7L7`!n29;J,7S!kxcdd``vfb``baV d,FYzP1n:Q! KA?H1 f@=P5< %! vf: KX+ss6q_QXAky8g P.P56j%<[+ՋwCaڳq9[ DAd++&t?xf(/,+k_P10Ypq<7Ll 3o dO $n3܊8 ~uPկ@7gc_f)pj8p_ o1YE鹉&'ϭaɼwYE\`cUTzp9Ĥ\Y\2C D,Ā.`z܃  !"#$'*+,-0./12456789:;<=>?@ABCDEFRoot Entry F0@Q) Data WordDocument"8ObjectPool 0@Q_1100866268 F  Ole CompObjfObjInfo  "#$%'()*+,./013 FMicrosoft Equation 3.0 DS Equation Equation.39q.XII =argmax v,||v||=1 1mv"r i "1k r i  " v"n inEquation Native  _1161460629 F``Ole  CompObj f i  " [] FMicrosoft Equation 3.0 DS Equation Equation.39q C .  &@6 & MathTypeObjInfo OlePres000  Equation Native !1Table3R'Symbol ؟ww wf-2  y Symbol ؟ww wf-2  y2 ySymbol ؟ww wf-2 =y Times New Roman؟ww wf- 2 q positions2  iy2 Pjy2 iy2  ty2  jy2 cCy2 ;cy2 NBTimes New Roman؟ww wf-2 scB2 ixB2 PB2 - cB2 PB2 7cBTimes New Roman؟ww wf-2 )B2 |B2 (B2  )B2  (B 2 argmax`Times New Roman؟ww wf-2 jr &  "Systemf !-NANIj\ c NB ='argmax c j "C P(c j ) t P(x i |c j ) i"positions "Oh+'0 SummaryInformation(&DocumentSummaryInformation8-0CompObj2j  < H T`hpxCS 276a Problem Set #1S 2Louis EisenbergouiouiNormalitauseri18sMicrosoft Word 9.0 @ X8 @3B%@՜.+,0 hp|  6+ @ CS 276a Problem Set #1 Title  FMicrosoft Word Document MSWordDocWord.Document.89q i<@< NormalCJPJ_HaJmH sH tH <A@< Default Paragraph Font*>@* Title$a$5\, @, Footer  !&)@&  Page Number`{~ `{~  8<=klhiuvLMIJKd a    4CMNPRTVXY[^adghjlnprsuwy{}~56=astu0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000@0@0@0 000000000000000 $$$'aXg~7KM::  '!! !/X2$T{Ʌ72$3N u1A3r"7@  ~ (  b 'Q-6  #" ,  BW CNDEFvvS &`g~([yGZ\:p:{vmZa2V<3W0eMzm  : c! 8O vR W W W  ;KPOSt_Xs;F'  b1CLHnNS<9*?", ~p~OwOw-U5C(%&  zz XfSRT@                                       `#" 'w4r  s * ̙m)M,  r  s * ̙ (+  r  s *̙-Y0 r  s *̙ /e 1 r  s *̙ a14 r  s *̙/2 r  s *̙ ,mi/ r  s *̙,+/ r  s *̙]!*(K- lB B 0D9, 9,r  s *̙% 0Q-6 h   S    "`  B S  ?dN$Wt !W t{ &5?H\abisx4:/04<OV  % ' p r Z_ (=E Xc",=wz#%Y[0 % CLPQRSTUVWYZhist~=Idluy3333333333333333333333333333333333*/8tKd 4CLPXYghrs}~6=sutauser;C:\Documents and Settings\tauser\My Documents\louis\ps2.docNPRTVXY[^adghjlnprsuwy{}~"@t@@UnknownGz Times New Roman5Symbol3& z ArialG MS Mincho-3 fg?& Arial Black"1hE&JFB + ,!0d@2QHPCS 276a Problem Set #1Louis Eisenbergtauser00000000000000000000000@0@0@0{0(0){0(0{0(00 @0 @0{0.00 @0 @0{0200 @0 @0 @0{00{00{00{00 {0A0Bd0y0A0 y0A0 y00 y0 0 {0@0AH0y0@0y0@00@0{0E0F0y0E0y0E00@0 @0 $$$'V C b!l!! !,7B P!l! k! l::  '!! !/X2$T{Ʌ2$3N u1A3r"@  ~ (  b 'Q-6  #" ,  BW CNDEFvvS &`g~([yGZ\:p:{vmZa2V<3W0eMzm  : c! 8O vR W W W  ;KPOSt_Xs;F'  b1CLHnNS<9*?", ~p~OwOw-U5C(%&  zz XfSRT@                                       `#" 'w4r  s * ̙m)M,  r  s * ̙ (+  r  s *̙-Y0 r  s *̙ /e 1 r  s *̙ a14 r  s *̙/2 r  s *̙ ,mi/ r  s *̙,+/ r  s *̙]!*(K- lB B 0D9, 9,r  s *̙% 0Q-6 h   S    "`  B S  ?l$Wt !W t pqz..//IIJJLLMMOOPPRRSSUUVVXXYY[[\\^^__aabjmFI_b(*tzjlej#-l or&/3MNPQSTVWYZ\]_`bhm33333333333333333333333333333333333 !",-78B-/HMNPQSTVWYZ\]_`bhmmqiudqiu0qiu2(qiuG)ZQ+qiu).qiueM/qiu0qiu\w@BCm"@  [  @{l@UnknownGz Times New Roman5Symbol3& z Arial]  MS MinchoArial Unicode MS?& Arial Black"1hE&E&9 , ,!4d2QHP ?I2CS 276a Problem Set #1Louis EisenbergLouis Eisenberg