ࡱ>  T "6777772848;;;<<$A&AFFJJ\JJ0KzK & F$ "77d7f7h7777777G"HLMnRRaPahiRjnjlnqzr}~zpD      . j7Uj4@ UV j 0Uj@ UVOJQJ5 6OJQJ H*OJQJ0Jj(/OJQJUjOJQJUOJQJ jU j'UBKarWaii (Sammy) Sy 4893129  HYPERLINK mailto:sammysy@cs.stanford.edu sammysy@cs.stanford.edu June 9th, 2002 CS224n: Project Enhanced Google Problem to be investigated Most Internet search engines, given a search query, try to find web documents containing the terms listed in the query. However, because the search process treats words purely as semantic-less identifiers, these same search engines are unable to distinguish different word senses, not mentioning documents about different topical domains. Frequently, users must manually scan through a long list of interleaved results in order to zoom in onto the desired ones. A solution using an NLP approach One way to solve the problem is to figure out which one of the different word senses that a document assumes. For instance, let's say the term "Michael Jordan" carries two senses, the basketball player and the Berkeley professor. A document containing "Michael Jordan" will use the term to denote either the basketball player or the professor, but probably not both. This is an important property because we can rely on the set of word senses to categorize documents, and thereby solving the above problem. Our goal now has been reduced to defining such a function that maps the set of documents containing W to the set of W's word senses. Like all other NLP problems, however, the gut of this function is completely non-trivial. For example, given a search query, how do we obtain the set of word senses? One possible solution is to use a dictionary like WordNet, but this approach falls apart quickly as soon as we have to deal with non-standard English terminologies, acronyms, compound nouns, etc. The space of all human-understandable terms is simply too huge. Moreover, word senses associated with a term can change over time, and thus maintaining a database of all sets of word senses would require too much work and human intervention. These problematic issues will be addressed more thoroughly in later sections of this report. My implemented solution involves inserting an extra processing layer between the users and a selected search engine, Google. This layer, which shall be referred to as Project Enhanced Google or PEG for short, will try to categorize search results returned by Google and present the categorized results neatly to the user. PEG's success will be judged by how well it catagorizes web documents, and how much convenience it brings to its users. PEG has a five-stage pipeline: Invoke the parent search engine to obtain a list of "raw" search results. Fetch and purify web documents. Extract features from web documents. Based on the extracted features, cluster the documents into groups by their similarity. Generate HTML for the end-user view. Detailed discussion of algorithms/method used Stage 1: Obtain raw search results I picked Google as the parent search engine because I like it myself, and because it allows search requests via a SOAP interface. By using the web service directly, I save the need to parse the usual HTML output manually while at the same time maintain future compatibility with this well defined interface. The plumbing behind this stage is of little NLP interest, and so I will reframe from elaborating further. Do keep in mind though that Google limits each licensee of the web service to 1000 requests per day. The quota is plentiful for obtaining search results, but I can't utilize Google's vast database to, for example, help examine keywords on a word-by-word basis. Stage 2: Fetch and purify web documents The Network API in Java is one of its strong points, so making HTTP requests to fetch web documents is straightforward. Originally, I was fetching documents sequentially (synchronous calls), but as soon as I started downloading 30 or 40 in one go, the total time spent just in this stage would exceed many minutes due to long socket timeouts and slow servers. To establash a soft upperbound in time requirement, I had to multithread this portion of the program to download multiple documents simultaneously, and upgrade to JDK 1.4 which provided an easy way to shorten socket timeouts. Some non-text document sources are deemed unprocessible by PEG. I exclude URLs with certain file extensions like .zip, .doc, .pdf, etc. The purification procedure is more interesting. I decided to convert everything to lower case. It's true that I will lose all information provided by word capitalization, but in return, the problem of data sparseness is alleviated, which I consider a more immediate problem. HTML documents on the web are not what you normally consider "clean" text. They often contain nasty junk like malformed XML syntax, javascript, symbols; basically, it's often the case that you don't find any clean paragraph of English sentences on a page. There's not much I can do about them, but I try my best using various adhoc heuristics, like filtering out tokens without any alphanumeric character, small integers, and stopwords. In addition, I try to purify each token by cropping away outer junk characters (e.g: punctuations, parenthesis). The purified text, judged by my eyes, is generally satisfactory and usable for all intents and purposes. Note that I throw away all information about sentence boundary, although the general word order is undisturbed. I learned that parsing web data is a huge pain in the neck. Stage 3: Extract features Extracting features from web documents turned out to be the core of the project and the most interesting problem from the NLP point of view. A web document contains a rich amount of information, from the text itself, URL, hyperlinks, HTML structure, to images. Being an NLP project, I decided to treat a web document simply as a sequence of words, without exploiting any of the informative HTML structure (e.g: title string) or hyperlinks. The quality of extracted features will suffer slightly, but at the same time the result of the project will reflect unambiguously the (in)effectiveness of NLP algorithms. Instead of using raw text as the feature set, I decided to distill it a bit by extracting keywords. "Keywords" of a document is defined to be a set of tokens that appear abnormally frequently in the document compared to the norm (i.e: normal English). Such a definition translates directly to a computable algorithm. Let count(w|doc) = { occurances of word w in document doc } Let count(doc) = { the number of tokens in doc } Hence, prob(w|doc) = count(w|doc) / count(doc) To model "normal English", I went to three years worth of New York Times articles from 1994 to 1996. 1.7 gigabytes of text is a ton of data, and in the end, I was able to digest approximately 240 million words, 640,000 of which are unique. I have to apply a primitive smoothing adjustment to handle unknown words, which are encounted often on the web. Let count(w|langauge) = { occurances of word w in the corpus } Let count(language) = { total number of tokens in the corpus } prob(w|language) = if count(w|language)==0, then return 0.5 / count(language) otherwise, return count(w|language) / count(language) Note that with this simple smoothing, prob(w|language) is no longer a valid probability mass function, but that is inconsequential as only the proportion is relevant in my calculation. Because count(language) is such a huge number, 0.5 is a decent guess of the number of occurances of unknown words. For each w in doc Let ratio = prob(w|doc) / prob(w|language) Let count = count(w|doc) w is a keyword of doc if ratio >= THRESHOLD_RATIO and count >= THRESHOLD_COUNT In English, the algorithm describes that a word is a keyword if the following two contraints are satisifed: The word appears abnormally frequent The word has at least repeated several times in the document The constraints make a lot of intuitive sense, and reflect the definition of keywords. The tricky part is deciding what to assign the two threshold values. As alluded in the introduction, I don't have access to a training set (general documents with labeled keywords), so it's impossible to train for threshold values aside from using my own judgement. This lack of a training set would continue to haunt me later from many other angles. Perhaps I should have placed more emphasis in locating a usable training set from the get go. It appears that THRESHOLD_RATIO=80 and THRESHOLD_COUNT=4 work decently well. Again, these values were picked without any theoretical backing, but most of the time the keyword selection is surprisingly accurate. Rarely does the algorithm fail disastrously. The decision of whether a word is a keyword is not entirely binary. Some keywords are naturally more relevant than others in a document. For example, given all else equal, a word that appears twice as frequently as the other word is obviously more likely to carry more relevance. Similarly, a word with a higher ratio should be considered more relevant. I use a simple scheme in computing this "importance" value. importance(keyword) = ratio(keyword) * count(keyword) Because the ratio equation is divisional, it can easily overwhelm the count value. Therefore, an unknown word may end up grabbing a huge share of importance, which arguably is the correct course of action. By not using raw text as features, I avoid potential problems comparing two documents with highly skewed lengths. As long as the number of keywords for each document is controlled, comparison among documents will not run into unnecessary biases towards or against long or short documents. As a protective measure, I keep only the top 10 (MAX_KEYWORDS) most important keywords to set a hard ceiling on the size of the feature set. So far, I have considered locating keywords which are unigrams, but there is a big problem: a keyword often comes packaged as a bigram, trigram, or an n-gram in general. For example, among many other things, names of people are bigrams or trigrams. Also, for many compound terms, the "worth" of the whole is greater than the sum of its individual constituents. "Operating System" means something special when used together, for instance. Indeed, the problem of finding collocated keywords is real and crucial. How should I extend the algorithm? The first question I asked was whether it is feasible to include more information with my language model. I thought about building a bigram model, which would be huge in size due to sheer number of unique bigrams expected to be found. If the space of unigram is barely managable, then the space of bigrams is certainly unthinkable. The problem of data sparseness with a bigram model will simply balloon to the point of being unusable. In fact, the number of false positives for bigram keywords is too great to justify the extra space (HUGE!) needed for the bigram model. Not only that, a bigram model still doesn't scale well to deal with n-gram keywords regardless. Then I started thinking inductively. If I make one plausible assumption, I can collect collocated keywords relatively straightforwardly: an n-gram is a keyword only if all its substrings of length (n-1) are also keywords. For example, if "operating" and "system" have both established themselves as keywords, then "operating system" and "system operating" are both candidates to become bigram keywords. I will enforce an additional constraint where the n-gram keyword must also have appeared at least a certain number of times to qualify. By the way, this algorithm was inspired by the A-Priori algorithm for computing frequent item sets. Recursive algorithm: An n-gram w is a keyword if: If (n==1) then, return true if w is a unigram keyword; return false otherwise. Otherwise, Let left = w[0,len-2] // substring Let right = w[1,len-1] Let count = count(w|doc) return true if both left and right are keywords and count >= THRESHOLD_COUNT return false otherwise. Note that I'm not accounting for the ratio factor which I used in judging unigram keywords. The reason is that prob(n-gram|language) is a tough one to define with just a unigram frequency language model. The Naive Bayes Assumption, which worked admirably in the word sense disambiguation project, was a complete failure here. Given the immense size of the corpus, prob(unigram|language) was already tiny. Applying the Naive Bayes Assumption, a series of successive multiplications, led to microscopic prob(n-gram|language), which in turn led to exceeding large and useless ratio values. By making the assumption that an n-gram is a keyword only if all its substrings are also keywords, I'm oversimplying the space of n-gram keywords. A serious shortcoming is true negatives on n-gram keywords which contain a common word. For instance, "My Little Lover", a popular band in Japan, will never be collected as a trigram keyword because "my" is a common/stop word, and hence not a keyword by itself. But other than that, many quality compound keywords are successfully assembled from the list of unigram keywords. The last part of the keyword extraction phase is removing duplicated (n-1)-gram keywords that have been used to construct n-gram keywords. This step is to avoid having "natural language" and "language processing" as bigram keywords whenever "natural language processing" is a keyword. The removal algorithm is also inductive in nature (and somewhat complex), but a nice property it has is that if "linux operating system" is a keyword, it is still possible for "linux" to be a keyword by itself provided it's appeared by itself enough. Although I didn't get enough time to experiment with other keyword extraction algorithms, I thought about using other cues to improve results. For one, it would be nice if each word in the text had been tagged for its part of speech. A general observation is that keywords are most often nouns or compound nouns. Or to think of this from the opposite direction, it is rare to find adverbs and adjectives as keywords, intuitively. The presence of part of speech tags would allow an additional constraint. Actually, an machine-accessible dictionary with part of speech listing for a word could be used to approximate this "keyword is a noun" constraint as well. Secondly, collocated keywords maybe "verified" by consulting Google's search database. If an n-gram is found to appear in a large number of web documents, then likely such an n-gram is a valid keyword. However, the limited number of Google web service requests disallowed this trick. On well-behaving documents, ones with plenty of well-formed English sentences, the result of keyword extraction is spectacular (more supporting data to follow). However, there are pages out on the web that are entirely Flash-based, image-based, or frame-based. In those cases, no keywords will be found, and such a document will become featureless. This is a huge problem, as PEG won't be able to do anything about these misbehaving webpages at all. Stage 4: Automatically cluster web documents by keywords Given keywords as document features, we can prepare the space of documents for unsupervised clustering. Our goal is two-fold: Group similar documents together Find well-separated clusters Each point in this document space is and completely and uniquely identified by its set of keywords. This is a non-Euclidean space without the normal notion of n-dimensional coordinates, but a distance (similarity) function between two documents is definable. There are three types of distance functions that I must define: Keyword-to-Keyword Document-to-Document Cluster-to-Cluster For the first type of distance functions, a straightforward definition is the string equality test. Two keywords are equal if and only if the strings match completely. At first glance, this definition seems logical, but unfortunately, it doesn't handle n-gram keywords fairly. For instance, "linux" and "linux kernel", which is semantically similar, will be unfairly marked as dissimilar. To deal with n-gram keywords, I must allow fractional/partial similarity. Again, I do not know the theoretically optimal distance function in this case, but here's a logical option: keywordDistance(A,B) Let a be a substring of A Let b be a substring of B such that a == b, and len(a)=len(b) is maximized Let intersect = len(a) Let union = len(A) + len(B) intersect return intersect / union This definition resembles the Jaccard measure. Some examples: distance("a b", "a") = 0.5, distance("a b c", "b c d") = 0.5, distance("a b c", "a c") = 0.33. Other more indirect metrics may be used to compare keywords. For example, I could try to use WordNet's synonym listing to approximate the conceptual distance between terms representing similar ideas, like how "ship" is close the meaning of "boat". However, WordNet's database most of the time was inadaquate in handling non-standard terms, like "Oracle" as a software company., or n-gram keywords. Normally, we can use the Jaccard or Cosine measure to compute document-to-document distance function. However, with fractional keyword equality, I have to abandon the Jaccard measure because the set of keywords is no longer a set of distinct objects. documentDistance(A,B) total = 0 For each where 0 <= i < len(A), 0 <= j < len(B) total = total + keywordDistance(A(i), B(j)) return total / (len(A) * len(B)) Essentially, it's computing the average word-to-word distance for each pairs of keywords in the two sets. It has the property that documentDistance(A,B) == documentDistance(B,A). The effect of the Cluster-to-Cluster distance function is more subtle. I've experimented with two definitions. In the first one, I defined the inter-cluster instance to be the distance of the two closest points between the two clusters. The advantage of this definition is that it allows for more "irregularly-shaped" clusters, but it's also less stable, leading to ill-separated clusters. The second definition is based on the average document-to-document distance between each pair of documents in the two clusters. The advantage of this definiton is robustness against linking two well separated large clusters, but the shapes of the clusters tend to be "spherical" (i.e: likes to expand uniformly in all direction). Fortunately, due to the relatively few data points (30-50), both definitions perform about identically. I employ the straightforward Hierarchical Clustering algorithm to group the documents, but exactly how do you determine the optimal number of clusters? We can compute the quality of a cluster by its "compactness", approximated by its average pair-wise document-to-document distances (think: clique). It's a measure of how well-separated clusters are to one another. If we define compact(n) to be the micro-average cluster compactness for having n clusters (chosen by the hierachical clustering algorithm), then it can be proven that it's an increasing function (decreases as n approaches 1); when n == number of documents, compact(n) = 1.0, but as soon as we start grouping different documents together, the overall compactness decreases, approaching 0.0. Naturally, we want clusters to be as tightly packed as possible, but it can also be argued to we prefer as few necessary groups as possible. My hope is that these two competing qualities will single out an optimal number of clusters. I do not know for certain how to model "as few necessary groups as possible", and without a training set, it's tough to guess thresholds. In the end after taking PEG for a few manual test drives, I decided to use a normal distribution, with a mean of 6.0 and variance of 5.0 (in units of clusters). These are arbitrary, but nevertheless not completely random. Afterwards, each cluster is labeled by its prominent keywords, where prominence is measured by the number of documents in the cluster containing a particular keyword. I choose the top four most prominent keywords as to control the length of a cluster label. Regrettably, I'm most dissatisfied with this part of the project where I have to determine the optimal number of clusters without any reasonable justification. There has to be a better, more logically-sound way to compute this function, but unfortunately I ran out of time. An EM approach maybe able to group and label clusters simultaneously. Stage 5: Generate HTML output for the end-user view With the clustered groups, I'm finally ready to generate HTML code for the search results page. This is straightforward, but there are a few features worth mentioning. First, because PEG goes out of its way to download web documents pointed by the search results, it is able to show precisely which URLs are unavailable at the moment (e.g: 404, 403 errors). Secondly, the result page includes the list of keywords associated with each document, which acts as a decent summary of the content ahead. Results and performance The problem at hand is too complex for a machine to automatically decide whether a particular cluster configuration is "good". Without a usable traininig set, tuning the system and experimenting with new ideas quickly became an exhaustive task of testing on a few hopefully representative test search queries and eyeballing results manually. Certainly, this is not the best approach in judging an NLP system. We start with the performance of the keyword extraction engine. With the default parameters, I computed the sets of keywords, ordered by importance, for the following sites (as of today): threshold=80, count=4 http://www.berkeley.edu/uc, berkeley, campus, sites, student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/u.s, cnn.com, cnn, videohttp://www.stanford.edu/class/cs224n/java, natural language processing, language processing, processing, assignments, assignment, gates, section, programming, pleasehttp://nlp.stanford.edu/~manning/+1 650, nlp, probabilistic, ling, linguistics, grammar, natural language processing, christopher manning, language processing, manning If I reduce the ratio threshold to 40, I will obtain the following additional keywords: threshold=40, count=4 http://www.berkeley.edu/http://www.stanford.edu/http://www.mit.edu/newshttp://www.cnn.com/sportshttp://www.stanford.edu/class/cs224n/final projecthttp://nlp.stanford.edu/~manning/ As you can see, there is not a whole lot of differences despite a significant reduction in the ratio threshold. This is good news because it means good keywords tend to be good at prevailing themselves from the norm. Let's reduce the count threshold, while returning the ratio threshold back to its original value: threshold=80, count=3 http://www.berkeley.edu/uc, berkeley, cal, campus, extension, sites, sessions, student, students, junehttp://www.stanford.edu/alumni, stanford university, stanfordhttp://www.mit.edu/mit, events, programshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, jagger, cnn, video, die, gashttp://www.stanford.edu/class/cs224n/ling, java, natural language processing, language processing, processing, directory, assignments, assignment, manning, gateshttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic, ling, science linguistics, linguistics, computational, grammar threshold=80, count=2 http://www.berkeley.edu/2002, a-z, sessions uc extension, finder map, uc, undergraduate, berkeley, cal student, web sites, librarieshttp://www.stanford.edu/directories, alumni, stanford university, stanfordhttp://www.mit.edu/mit web, search mit, mit, directory, events calendar, faculty staff, tech, campus, offices programs, massachusettshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, time.com, 2002, jagger knighted, mick jagger, in-depth, swamp netshttp://www.stanford.edu/class/cs224n/cs 224n ling 237 natural language processing, eval.pl script, e.g., hyponym, wmhome, jwordnet, stdin", ling, java, computationalhttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic models, probabilistic, hinrich sch tze, 2001, 2000, 1999 A lot more unimportant keywords seem to have cropped up when the threshold is set too low. Let's increase the count threshold: threshold=80, count=5 http://www.berkeley.edu/student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/cnn, videohttp://www.stanford.edu/class/cs224n/language processing, processing, section, naturalhttp://nlp.stanford.edu/~manning/probabilistic, linguistics, natural language processing, christopher manning, lanugage processing, manning, processing, text, language, natural It seems a count threshold of 5 (or higher) is too restrictive. Too many otherwise good and valid keywords are discarded. A systematic method of evaluating clustering results is not immediately obvious. But I can demonstrate its effectiveness with a few examples. Let's try "Berkeley": query=Berkeley cluster namecluster sizecorrect countcorrect percentage1students22100%2seti@home, june 2002, 2002, sponsors11100%3library, libraries33100%4library, uc, collections, web11100%5berkeley, information8787.5%6seismological laboratory, earthquake information, earthquake, berkeley11100%7web22100%8{unrelated/lonely}12 Of course, it easy to have high correct percentage if cluster size is small and the cluster group general enough. The documents in the {unrelated/lonely} cluster are either featureless, or unrelated to all other documents in the pool. Unfortunately, although the quality of individual clusters is decent, we would have preferred a fewer clusters. For example, the "students" and "berkeley, information" should form a larger cluster. Also, the number of unsorted documents fallen into the {unrelated} cluster is substantial. I mentioned that earlier that I evaluate cluster "compactness" by computing its average internal distance in a clique of documents configuration, how the compactness function is a decreasing function as the number of clusters approaches one. Here's an illustration:  Note that this function doesn't exhibit a "cliff", or a suddent drop, as n approaches 0. Let's search for "doom": query=doom cluster namecluster sizecorrect countcorrect percentage1doom, doom iii, 2002, iii161487.5%2science fiction fantasy horror, planet doom, sites, science fiction11100%3sitebad category4sgi doom, sgi version, version sgixdoom11100%532+32kwb asus tx97-e 430tx.... 2000, pv, pci11100%6armdeu 1.17, armdeu, wadptr 2.3, wadptr11100%7linux, 10%, 2.5%, freebsd11100%8{unrelated/lonely}7 The cluster labeled "site" is a bad category because it doesn't mean anything. It would have been preferred if the "sgi doom" and "32+32kwb" (a benchmarking site) clusters are merged with the main "doom" (the idSoftware game) cluster. PEG was able to single out a cluster for a horror story, and an essay about linux being Microsoft's impending doom.   Related work in the literature There have been several text clustering systems developed in the past. Xerox PARC's Scatter/Gather implemented a similar system, but it didn't try to distill document features (i.e: keywords), and it used a straightforward cosine measure to define inter-document distances. Andrei Broder of DEC had also done some work in defining a inter-document distance function specifically targeted for web document. It used the mathematical notion of "resemblance" and "containment" (symmetric vs asymmetric relationship), the latter of which PEG does not attempt to account for.  EMBED Excel.Sheet.8   EMBED Excel.Sheet.8   EMBED Excel.Sheet.8  FGH_`gip ""<!!_''d12>?CZDG6HSTjjnnnnnnpppppppppppppppppþ jfUj@ UV jfUj4@ UV jUj@ UV jUOJQJ5 6OJQJ H*OJQJ0JjOJQJUjOJQJUOJQJ@ap~& ' ,Q & F$ap~& ' ,Q"?@!"<ļ|yvp  56   /  T  t    YZ+"?@!"< R&fz< R&fz!!/![!v!!!!!>"c""$$%%^'_'''f(g(**,,..|yvsC  fgij    PQm45m<=?@,z!!/![!v!!!!!>"c""$$%%^'_'''f(g(** & F h*,,..c1d1z111111!2;2W22222!5"50717K9L9==>p.c1d1z111111!2;2W22222!5"50717K9L9==>>>????AA0ACADACCCCCDDBD[D\DDDFFG       '   H  WXPQl,9QxB2>>>????AA0ACADACCCCCDDBD[D\DDDFF & F h & F hFGGGGGH7H8HHH+L,LzQ{Q~RRSSSTVVVVV VWWyXGGGGGH7H8HHH+L,LzQ{Q~RRSSSTVVVVV VWWyXzXXXXXXXXYYY)YBYCYiYYY ZZZZZZ[[[ [9[:[;[O[T[U[i[p[q[[[[[[[[] ]]8]]]]]]]]]^I^J^p^^^_|_}_~___` `yXzXXXXXXXXYYY)YBYCYiYYY ZZZZZZ[[[d$$l $[ [9[:[;[O[T[U[i[p[q[[[[[[[[] ]]8]]]]]]lhp$$$l ]]]]^I^J^p^^^_|_}_~___``4`g`h`|```alama\<4 $$l $``4`g`h`|```alamaabb7bbbbcc5cNc`caczccccccccccdd8ddddEeFeeeee ff%f8f9f;fDfFfHfMfNfPfufwfyf~fffffffffffffffffffff4g6g8g=g>g@gDgFgHgMgNgPgcgfggghgig bmaabb7bbbbcc5cNc`caczccccccccccdd8d dd$$l $8ddddEeFeeeee ff%f8f9f;fDfFfHfMfNfPfufT$$ljm $$l $ufwfyf~fffffffffffffffffffff4g6g8g|ߨߌL$$ljm $8g=g>g@gDgFgHgMgNgPgcgfggghgigjgzi{ijjjjjjjk@l$$ljm $igjgzi{ijjjjjjjk kk#k1kDkEkGkakdkgkmknkpkkkkkkkkkkkkkklll l l8l:ln?nRoSo~pppppppp Tk kk#k1kDkEkGkakdkgkmknkpkkkkkkkkkkkkkkߤ@\$$ljm $klll l l8l:ln?nRoSo~pppppppppppp$$ljm $'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PKarWaii (Sammy) Sy 4893129  HYPERLINK mailto:sammysy@cs.stanford.edu sammysy@cs.stanford.edu June 9th, 2002 CS224n: Project Enhanced Google Problem to be investigated Most Internet search engines, given a search query, try to find web documents containing the terms listed in the query. However, because the search process treats words purely as semantic-less identifiers, these same search engin/ =!"#$%`!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqIKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ېes are unable to distinguish different word senses, not mentioning documents about different topical domains. Frequently, users must manually scan through a long list of interleaved results in order to zoom in onto the desired ones. A solution using an DyK sammysy@cs.stanford.eduyK >mailto:sammysy@cs.stanford.eduDd 0  # A2cw~Qk +2 `!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqIKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP PNLP approach One way to solve the problem is to figure out which one of the different word senses that a document assumes. For instance, let's say the term "Michael Jordan" carries two senses, the basketball player and the Berkeley professor. A document containing "Michael Jordan" will use the term to denote either the basketball player or the professor, but probably not both. This is an important property because we can rely on the set of word senses to categorize documents, and thereby solving the above problem. Our goal now has been reduced to defining such a function that maps the set of documents containing W to the set of W's word senses. Like all other NLP problems, however, the gut of this function is completely non-trivial. For example, given a search query, how do we obtain the set of word senses? One possible solution is to use a dictionary like WordNet, but this approach falls apart quickly as soon as we have to deal with non-standard English terminologies, acronyms, compound nouns, etc. The space of all human-understandable terms is simply too huge. Moreover, word senses associated with a term can change over time, and thus maintaining a database of all sets of word senses would require too much work and human intervention. These problematic issues will be addressed more thoroughly in later sections of this report. My implemented solution involves inserting an extra processing layer between the users and a selected search engine, Google. This layer, which shall be referred to as Project Enhanced Google or PEG for short, will try to categorize search results returned by Google and present the categorized results neatly to the user. PEG's success will be judged by how well it catagorizes web documents, and how much convenience it brings to its users. PEG has a five-stage pipeline: Invoke the parent search engine to obtain a list of "raw" search results. Fetch and purify web documents. Extract features from web documents. Based on the extracted features, cluster the documents into groups by their similarity. Generate HTML for the end-user view. Detailed discussion of algorithms/method used Stage 1: Obtain raw search results I picked Google as the parent search engine because I like it myself, and because it allows search requests via a SOAP interface. By using the web service directly, I save the need to parse the usual HTML output manually while at the same time maintain future compatibility with this well defined interface. The plumbing behind this stage is of little NLP interest, and so I will reframe from elaborating further. Do keep in mind though that Google limits each licensee of the web service to 1000 requests per day. The quota is plentiful for obtaining search results, but I can't utilize Google's vast database to, for example, help examine keywords on a word-by-word basis. Stage 2: Fetch and purify web documents The Network API in Java is one of its strong points, so making HTTP requests to fetch web documents is straightforward. Originally, I was fetching documents sequentially (synchronous calls), but as soon as I started downloading 30 or 40 in one go, the total time spent just in this stage would exceed many minutes due to long socket timeouts and slow servers. To establash a soft upperbound in time requirement, I had to multithread this portion of the program to download multiple documents simultaneously, and upgrade to JDK 1.4 which provided an easy way to shorten socket timeouts. Some non-text document sources are deemed unprocessible by PEG. I exclude URLs with certain file extensions like .zip, .doc, .pdf, etc. The purification procedure is more interesting. I decided to convert everything to lower case. It's true that I will lose all information provided by word capitalization, but in return, the problem of data sparseness is alleviated, which I consider a more immediate problem. HTML documents on the web are not what you normally consider "clean" text. They often contain nasty junk like malformed XML syntax, javascript, symbols; basically, it's often the case that you don't find any clean paragraph of English sentences on a page. There's not much I can do about them, but I try my best using various adhoc heuristics, like filtering out tokens without any alphanumeric character, small integers, and stopwords. In addition, I try to purify each token by cropping away outer junk characters (e.g: punctuations, parenthesis). The purified text, judged by my eyes, is generally satisfactory and usable for all intents and purposes. Note that I throw away all information about sentence boundary, although the general word order is undisturbed. I learned that parsing web data is a huge pain in the neck. Stage 3: Extract features Extracting features from web documents turned out to be the core of the project and the most interesting problem from the NLP point of view. A web document contains a rich amount of information, from the text itself, URL, hyperlinks, HTML structure, to images. Being an NLP project, I decided to treat a web document simply as a sequence of words, without exploiting any of the informative HTML structure (e.g: title string) or hyperlinks. The quality of extracted features will suffer slightly, but at the same time the result of the project will reflect unambiguously the (in)effectiveness of NLP algorithms. Instead of using raw text as the feature set, I decided to distill it a bit by extracting keywords. "Keywords" of a document is defined to be a set of tokens that appear abnormally frequently in the document compared to the norm (i.e: normal English). Such a definition translates directly to a computable algorithm. Let count(w|doc) = { occurances of word w in document doc } Let count(doc) = { the number of tokens in doc } Hence, prob(w|doc) = count(w|doc) / count(doc) To model "normal English", I went to three years worth of New York Times articles from 1994 to 1996. 1.7 gigabytes of text is a ton of data, and in the end, I was able to digest approximately 240 million words, 640,000 of which are unique. I have to apply a primitive smoothing adjustment to handle unknown words, which are encounted often on the web. Let count(w|langauge) = { occurances of word w in the corpus } Let count(language) = { total number of tokens in the corpus } prob(w|language) = if count(w|language)==0, then return 0.5 / count(language) otherwise, return count(w|language) / count(language) Note that with this simple smoothing, prob(w|language) is no longer a valid probability mass function, but that is inconsequential as only the proportion is relevant in my calculation. Because count(language) is such a huge number, 0.5 is a decent guess of the number of occurances of unknown words. For each w in doc Let ratio = prob(w|doc) / prob(w|language) Let count = count(w|doc) w is a keyword of doc if ratio >= THRESHOLD_RATIO and count >= THRESHOLD_COUNT In English, the algorithm describes that a word is a keyword if the following two contraints are satisifed: The word appears abnormally frequent The word has at least repeated several times in the document The constraints make a lot of intuitive sense, and reflect the definition of keywords. The tricky part is deciding what to assign the two threshold values. As alluded in the introduction, I don't have access to a training set (general documents with labeled keywords), so it's impossible to train for threshold values aside from using my own judgement. This lack of a training set would continue to haunt me later from many other angles. Perhaps I should have placed more emphasis in locating a usable training set from the get go. It appears that THRESHOLD_RATIO=80 and THRESHOLD_COUNT=4 work decently well. Again, these values were picked without any theoretical backing, but most of the time the keyword selection is surprisingly accurate. Rarely does the algorithm fail disastrously. The decision of whether a word is a keyword is not entirely binary. Some keywords are naturally more relevant than others in a document. For example, given all else equal, a word that appears twice as frequently as the other word is obviously more likely to carry more relevance. Similarly, a word with a higher ratio should be considered more relevant. I use a simple scheme in computing this "importance" value. importance(keyword) = ratio(keyword) * count(keyword) Because the ratio equation is divisional, it can easily overwhelm the count value. Therefore, an unknown word may end up grabbing a huge share of importance, which arguably is the correct course of action. By not using raw text as features, I avoid potential problems comparing two documents with highly skewed lengths. As long as the number of keywords for each document is controlled, comparison among documents will not run into unnecessary biases towards or against long or short documents. As a protective measure, I keep only the top 10 (MAX_KEYWORDS) most important keywords to set a hard ceiling on the size of the feature set. So far, I have considered locating keywords which are unigrams, but there is a big problem: a keyword often comes packaged as a bigram, trigram, or an n-gram in general. For example, among many other things, names of people are bigrams or trigrams. Also, for many compound terms, the "worth" of the whole is greater than the sum of its individual constituents. "Operating System" means something special when used together, for instance. Indeed, the problem of finding collocated keywords is real and crucial. How should I extend the algorithm? The first question I asked was whether it is feasible to include more information with my language model. I thought about building a bigram model, which would be huge in size due to sheer number of unique bigrams expected to be found. If the space of unigram is barely managable, then the space of bigrams is certainly unthinkable. The problem of data sparseness with a bigram model will simply balloon to the point of being unusable. In fact, the number of false positives for bigram keywords is too great to justify the extra space (HUGE!) needed for the bigram model. Not only that, a bigram model still doesn't scale well to deal with n-gram keywords regardless. Then I started thinking inductively. If I make one plausible assumption, I can collect collocated keywords relatively straightforwardly: an n-gram is a keyword only if all its substrings of length (n-1) are also keywords. For example, if "operating" and "system" have both established themselves as keywords, then "operating system" and "system operating" are both candidates to become bigram keywords. I will enforce an additional constraint where the n-gram keyword must also have appeared at least a certain number of times to qualify. By the way, this algorithm was inspired by the A-Priori algorithm for computing frequent item sets. Recursive algorithm: An n-gram w is a keyword if: If (n==1) then, return true if w is a unigram keyword; return false otherwise. Otherwise, Let left = w[0,len-2] // substring Let right = w[1,len-1] Let count = count(w|doc) return true if both left and right are keywords and count >= THRESHOLD_COUNT return false otherwise. Note that I'm not accounting for the ratio factor which I used in judging unigram keywords. The reason is that prob(n-gram|language) is a tough one to define with just a unigram frequency language model. The Naive Bayes Assumption, which worked admirably in the word sense disambiguation project, was a complete failure here. Given the immense size of the corpus, prob(unigram|language) was already tiny. Applying the Naive Bayes Assumption, a series of successive multiplications, led to microscopic prob(n-gram|language), which in turn led to exceeding large and useless ratio values. By making the assumption that an n-gram is a keyword only if all its substrings are also keywords, I'm oversimplying the space of n-gram keywords. A serious shortcoming is true negatives on n-gram keywords which contain a common word. For instance, "My Little Lover", a popular band in Japan, will never be collected as a trigram keyword because "my" is a common/stop word, and hence not a keyword by itself. But other than that, many quality compound keywords are successfully assembled from the list of unigram keywords. The last part of the keyword extraction phase is removing duplicated (n-1)-gram keywords that have been used to construct n-gram keywords. This step is to avoid having "natural language" and "language processing" as bigram keywords whenever "natural language processing" is a keyword. The removal algorithm is also inductive in nature (and somewhat complex), but a nice property it has is that if "linux operating system" is a keyword, it is still possible for "linux" to be a keyword by itself provided it's appeared by itself enough. Although I didn't get enough time to experiment with other keyword extraction algorithms, I thought about using other cues to improve results. For one, it would be nice if each word in the text had been tagged for its part of speech. A general observation is that keywords are most often nouns or compound nouns. Or to think of this from the opposite direction, it is rare to find adverbs and adjectives as keywords, intuitively. The presence of part of speech tags would allow an additional constraint. Actually, an machine-accessible dictionary with part of speech listing for a word could be used to approximate this "keyword is a noun" constraint as well. Secondly, collocated keywords maybe "verified" by consulting Google's search database. If an n-gram is found to appear in a large number of web documents, then likely such an n-gram is a valid keyword. However, the limited number of Google web service requests disallowed this trick. On well-behaving documents, ones with plenty of well-formed English sentences, the result of keyword extraction is spectacular (more supporting data to follow). However, there are pages out on the web that are entirely Flash-based, image-based, or frame-based. In those cases, no keywords will be found, and such a document will become featureless. This is a huge problem, as PEG won't be able to do anything about these misbehaving webpages at all. Stage 4: Automatically cluster web documents by keywords Given keywords as document features, we can prepare the space of documents for unsupervised clustering. Our goal is two-fold: Group similar documents together Find well-separated clusters Each point in this document space is and completely and uniquely identified by its set of keywords. This is a non-Euclidean space without the normal notion of n-dimensional coordinates, but a distance (similarity) function between two documents is definable. There are three types of distance functions that I must define: Keyword-to-Keyword Document-to-Document Cluster-to-Cluster For the first type of distance functions, a straightforward definition is the string equality test. Two keywords are equal if and only if the strings match completely. At first glance, this definition seems logical, but unfortunately, it doesn't handle n-gram keywords fairly. For instance, "linux" and "linux kernel", which is semantically similar, will be unfairly marked as dissimilar. To deal with n-gram keywords, I must allow fractional/partial similarity. Again, I do not know the theoretically optimal distance function in this case, but here's a logical option: keywordDistance(A,B) Let a be a substring of A Let b be a substring of B such that a == b, and len(a)=len(b) is maximized Let intersect = len(a) Let union = len(A) + len(B)  intersect return intersect / union This definition resembles the Jaccard measure. Some examples: distance("a b", "a") = 0.5, distance("a b c", "b c d") = 0.5, distance("a b c", "a c") = 0.33. Other more indirect metrics may be used to compare keywords. For example, I could try to use WordNet's synonym listing to approximate the conceptual distance between terms representing similar ideas, like how "ship" is close the meaning of "boat". However, WordNet's database most of the time was inadaquate in handling non-standard terms, like "Oracle" as a software company., or n-gram keywords. Normally, we can use the Jaccard or Cosine measure to compute document-to-document distance function. However, with fractional keyword equality, I have to abandon the Jaccard measure because the set of keywords is no longer a set of distinct objects. documentDistance(A,B) total = 0 For each <i,j> where 0 <= i < len(A), 0 <= j < len(B) total = total + keywordDistance(A(i), B(j)) return total / (len(A) * len(B)) Essentially, it's computing the average word-to-word distance for each pairs of keywords in the two sets. It has the property that documentDistance(A,B) == documentDistance(B,A). The effect of the Cluster-to-Cluster distance function is more subtle. I've experimented with two definitions. In the first one, I defined the inter-cluster instance to be the distance of the two closest points between the two clusters. The advantage of this definition is that it allows for more "irregularly-shaped" clusters, but it's also less stable, leading to ill-separated clusters. The second definition is based on the average document-to-document distance between each pair of documents in the two clusters. The advantage of this definiton is robustness against linking two well separated large clusters, but the shapes of the clusters tend to be "spherical" (i.e: likes to expand uniformly in all direction). Fortunately, due to the relatively few data points (30-50), both definitions perform about identically. I employ the straightforward Hierarchical Clustering algorithm to group the documents, but exactly how do you determine the optimal number of clusters? We can compute the quality of a cluster by its "compactness", approximated by its average pair-wise document-to-document distances (think: clique). It's a measure of how well-separated clusters are to one another. If we define compact(n) to be the micro-average cluster compactness for having n clusters (chosen by the hierachical clustering algorithm), then it can be proven that it's an increasing function (decreases as n approaches 1); when n == number of documents, compact(n) = 1.0, but as soon as we start grouping different documents together, the overall compactness decreases, approaching 0.0. Naturally, we want clusters to be as tightly packed as possible, but it can also be argued to we prefer as few necessary groups as possible. My hope is that these two competing qualities will single out an optimal number of clusters. I do not know for certain how to model "as few necessary groups as possible", and without a training set, it's tough to guess thresholds. In the end after taking PEG for a few manual test drives, I decided to use a normal distribution, with a mean of 6.0 and variance of 5.0 (in units of clusters). These are arbitrary, but nevertheless not completely random. Afterwards, each cluster ݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(.Dd 0  # A2FJO~v`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycis labeled by its prominent keywords, where prominence is measured by the number of documents in the cluster containing a particular keyword. I choose the top four most prominent keywords as to control the length of a cluster label. Regrettably, I'm most dissatisfied with this part of the project where I have to determine the optimal number of clusters without any reasonable justification. There has to be a better, more logically-sound way to compute this function, but unfortunately I ran out of time. An EM approach maybe able to group and label clusters simultaneously. Stage 5: Generate HTML output for the end-user view With the clustered groups, I'm finally ready to generate HTML code for the search results page. This is straightforward, but there are a few features worth mentioning. First, because PEG goes out of its way to download web documents pointed by the search results, it is able to show precisely which URLs are unavailable at the moment (e.g: 404, 403 errors). Secondly, the result page in68:<JKKKKK2L4LOOOPP$U&UZZ^$.024:<KKdKfKhKKKKKKK["\`anffuPu|}R~n~zzpD    &&'''' '"'N'P'R'T' j"OUj4@ UV jGUj@ UVOJQJ5 6OJQJ H*OJQJ0JjFOJQJUjOJQJUOJQJ jU j?Uj@ UV@'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PKarWaii (Sammy) Sy 4893129  HYPERLINK mailto:sammysy@cs.stanford.edu sammysy@cs.stanford.edu June 9th, 2002 CS224n: Project Enhanced Google Problem to be investigated Most Internet search engines, given a search query, try to find web documents containing the terms listed in the query. However, because the search process treats words purely as semantic-less identifiers, these same search engines are unable to distinguish different word senses, not mentioning documents about different topical domains. Frequently, users must manually scan through a long list of interleaved results in order to zoom in onto the desired ones. A solution using an NLP approach One way to solve the problem is to figure out which one of the different word senses that a document assumes. For instance, let's say the term "Michael Jordan" carries two senses, the basketball player and the Berkeley professor. A document containing "Michael Jordan" will use the term to denote either the basketball player or the professor, but probably not both. This is an important property because we can rely on the set of word senses to categorize documents, and thereby solving the above problem. Our goal now has been reduced to defining such a function that maps the set of documents containing W to the set of W's word senses. Like all other NLP problems, however, the gut of this function is completely non-trivial. For example, given a search query, how do we obtain the set of word senses? One possible solution is to use a dictionary like WordNet, but this approach falls apart quickly as soon as we have to deal with non-standard English terminologies, acronyms, compound nouns, etc. cludes the list of keywords associated with each document, which acts as a decent summary of the content ahead. Results and performance The problem at hand is too complex for a machine to automatically decide whether a particular cluster configuration pppp%%d%f%h%%%%%%%5"6:;n@@OPOVWRXnXZ\_z`klzpD  vx½j@ UV jUj4@ UV jwUj@ UVOJQJ5 6OJQJ H*OJQJ0JjOJQJUjOJQJUOJQJ jUBppp$%%%%%2&4&)))**$/&/4488\8809z9*:t:v: & F$'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PKarWaii (Sammy) Sy 4893129  HYPERLINK mailto:sammysy@cs.stanford.edu sammysy@cs.stanford.edu June 9th, 2002 CS224n: Project Enhanced Google Problem to be investigated Most Internet search engines, given a search query, try to find web documents containing the terms listed in the query. However, because the search process treats words purely as semantic-less identifiers, these same search engines are unable to distinguish different word senses, not mentioning documents about different topical domains. Frequently, users must manually scan through a long list of interleaved results in order to zoom in onto the desired ones. A solution using an NLP approach One way to solve the problem is to figure out which one of the different word senses that a document assumes. For instance, let's say the term "Michael Jordan" carries two senses, the basketball player and the Berkeley professor. A document containing "Michael Jordan" will use the term to denote either the basketball player or the professor, but probably not both. This is an important property because we can rely on the set of word senses to categorize documents, and thereby solving the above problem. Our goal now has been reduced to defining such a function that maps the set of documents containing W to the set of W's word senses. Like all other NLP problems, however, the gut of this function is completely non-trivial. For example, given a search query, how do we obtain the set of word senses? One possible solution is to use a dictionary like WordNet, but this approach falls apart quickly as soon as we have to deal with non-standard English terminologies, acronyms, compound nouns, etc. The space of all human-understandable terms is simply too huge. Moreover, word senses associated with a term can change over time, and thus maintaining a database of all sets of word senses would require too much work and human intervention. These problematic issues will be addressed more thoroughly in later sections of this report. My implemented solution involves inserting an extra processing layer between the users and a selected search engine, Google. This layer, which shall be referred to as Project Enhanced Google or PEG for short, will try to categorize search results returned by Google and present the categorized results neatly to the user. PEG's success will be judged by how well it catagorizes web documents, and how much convenience it brings to its users. PEG has a five-stage pipeline: Invoke the parent search engine to obtain a list of "raw" search results. Fetch and purify web documents. Extract features from web documents. Based on the extracted features, cluster the documents into groups by their similarity. Generate HTML for the end-user view. Detailed discussion of algorithms/method used Stage 1: Obtain raw search results I picked Google as the parent search engine because I like it myself, and because it allows search requests via a SOAP interface. By using the web service directly, I save the need to parse the usual HTML output manually while at the same time maintain future compatibility with this well defined interface. The plumbing behind this stage is of little NLP interest, and so I will reframe from elaborating further. Do keep in mind though that Google limits each licensee of the web service to 1000 requests per day. The quota is plentiful for obtaining search results, but I can't utilize Google's vast database to, for example, help examine keywords on a word-by-word basis. Stage 2: Fetch and purify web documents The Network API in Java is one of its strong points, so making HTTP requests to fetch web documents is straightforward. Originally, I was fetching documents sequentially (synchronous calls), but as soon as I started downloading 30 or 40 in one go, the total time spent just in this stage would exceed many minutes due to long socket timeouts and slow servers. To establash a soft upperbound in time requirement, I had to multithread this portion of the program to download multiple documents simultaneously, and upgrade to JDK 1.4 which provided an easy way to shorten socket timeouts. Some non-text document sources are deemed unprocessible by PEG. I exclude URLs with certain file extensions like .zip, .doc, .pdf, etc. The purification procedure is more interesting. I decided to convert everything to lower case. It's true that I will lose all information provided by word capitalization, but in return, the problem of data sparseness is alleviated, which I consider a more immediate problem. HTML documents on the web are not what you normally consider "clean" text. They often contain nasty junk like malformed XML syntax, javascript, symbols; basically, it's often the case that you don't find any clean paragraph of English sentences on a page. There's not much I can do about them, but I try my best using various adhoc heuristics, like filtering out tokens without any alphanumeric character, small integers, and stopwords. In addition, I try to purify each token by cropping away outer junk characters (e.g: punctuations, parenthesis). The purified text, judged by my eyes, is generally satisfactory and usable for all intents and purposes. Note that I throw away all information about sentence boundary, although the general word order is undisturbed. I learned that parsing web data is a huge pain in the neck. Stage 3: Extract features Extracting features from web documents turned out to be the core of the project and the most interesting problem from the NLP point of view. A web document contains a rich amount of information, from the text itself, URL, hyperlinks, HTML structure, to images. Being an NLP project, I decided to treat a web document simply as a sequence of words, without exploiting any of the informative HTML structure (e.g: title string) or hyperlinks. The quality of extracted features will suffer slightly, but at the same time the result of the project will reflect unambiguously the (in)effectiveness of NLP algorithms. Instead of using raw text as the feature set, I decided to distill it a bit by extracting keywords. "Keywords" of a document is defined to be a set of tokens that appear abnormally frequently in the document compared to the norm (i.e: normal English). Such a definition translates directly to a computable algorithm. Let count(w|doc) = { occurances of word w in document doc } Let count(doc) = { the number of tokens in doc } Hence, prob(w|doc) = count(w|doc) / count(doc) To model "normal English", I went to three years worth of New York Times articles from 1994 to 1996. 1.7 gigabytes of text is a ton of data, and in the end, I was able to digest approximately 240 million words, 640,000 of which are unique. I have to apply a primitive smoothing adjustment to handle unknown words, which are encounted often on the web. Let count(w|langauge) = { occurances of word w in the corpus } Let count(language) = { total number of tokens in the corpus } prob(w|language) = if count(w|language)==0, then return 0.5 / count(language) otherwise, return count(w|language) / count(language) Note that with this simple smoothing, prob(w|language) is no longer a valid probability mass function, but that is inconsequential as only the proportion is relevant in my calculation. Because count(language) is such a huge number, 0.5 is a decent guess of the number of occurances of unknown words. For each w in doc Let ratio = prob(w|doc) / prob(w|language) Let count = count(w|doc) w is a keyword of doc if ratio >= THRESHOLD_RATIO and count >= THRESHOLD_COUNT In English, the algorithm describes that a word is a keyword if the following two contraints are satisifed: The word appears abnormally frequent The word has at least repeated several times in the document The constraints make a lot of intuitive sense, and reflect the definition of keywords. The tricky part is deciding what to assign the two threshold values. As alluded in the introduction, I don't have access to a training set (general documents with labeled keywords), so it's impossible to train for threshold values aside from using my own judgement. This lack of a training set would continue to haunt me later from many other angles. Perhaps I should have placed more emphasis in locating a usable training set from the get go. It appears that THRESHOLD_RATIO=80 and THRESHOLD_COUNT=4 work decently well. Again, these values were picked without any theoretical backing, but most of the time the keyword selection is surprisingly accurate. Rarely does the algorithm fail disastrously. The decision of whether a word is a keyword is not entirely binary. Some keywords are naturally more relevant than others in a document. For example, given all else equal, a word that appears twice as frequently as the other word is obviously more likely to carry more relevance. Similarly, a word with a higher ratio should be considered more relevant. I use a simple scheme in computing this "importance" value. importance(keyword) = ratio(keyword) * count(keyword) Because the ratio equation is divisional, it can easily overwhelm the count value. Therefore, an unknown word may end up grabbing a huge share of importance, which arguably is the correct course of action. By not using raw text as features, I avoid potential problems comparing two documents with highly skewed lengths. As long as the number of keywords for each document is controlled, comparison among documents will not run into unnecessary biases towards or against long or short documents. As a protective measure, I keep only the top 10 (MAX_KEYWORDS) most important keywords to set a hard ceiling on the size of the feature set. So far, I have considered locating keywords which are unigrams, but there is a big problem: a keyword often comes packaged as a bigram, trigram, or an n-gram in general. For example, among many other things, names of people are bigrams or trigrams. Also, for many compound terms, the "worth" of the whole is greater than the sum of its individual constituents. "Operating System" means something special when used together, for instance. Indeed, the problem of finding collocated keywords is real and crucial. How should I extend the algorithm? The first question I asked was whether it is feasible to include more information with my language model. I thought about building a bigram model, which would be huge in size due to sheer number of unique bigrams expected to be found. If the space of unigram is barely managable, then the space of bigrams is certainly unthinkable. The problem of data sparseness with a bigram model will simply balloon to the point of being unusable. In fact, the number of false positives for bigram keywords is too great to justify the extra space (HUGE!) needed for the bigram model. Not only that, a bigram model still doesn't scale well to deal with n-gram keywords regardless. Then I started thinking inductively. If I make one plausible assumption, I can collect collocated keywords relatively straightforwardly: an n-gram is a keyword only if all its substrings of length (n-1) are also keywords. For example, if "operating" and "system" have both established themselves as keywords, then "operating system" and "system operating" are both candidates to become bigram keywords. I will enforce an additional constraint where the n-gram keyword must also have appeared at least a certain number of times to qualify. By the way, this algorithm was inspired by the A-Priori algorithm for computing frequent item sets. Recursive algorithm: An n-gram w is a keyword if: If (n==1) then, return true if w is a unigram keyword; return false otherwise. Otherwise, Let left = w[0,len-2] // substring Let right = w[1,len-1] Let count = count(w|doc) return true if both left and right are keywords and count >= THRESHOLD_COUNT return false otherwise. Note that I'm not accounting for the ratio factor which I used in judging unigram keywords. The reason is that prob(n-gram|language) is a tough one to define with just a unigram frequency language model. The Naive Bayes Assumption, which worked admirably in the word sense disambiguation project, was a complete failure here. Given the immense size of the corpus, prob(unigram|language) was already tiny. Applying the Naive Bayes Assumption, a series of successive multiplications, led to microscopic prob(n-gram|language), which in turn led to exceeding large and useless ratio values. By making the assumption that an n-gram is a keyword only if all its substrings are also keywords, I'm oversimplying the space of n-gram keywords. A serious shortcoming is true negatives on n-gram keywords which contain a common word. For instance, "My Little Lover", a popular band in Japan, will never be collected as a trigram keyword because "my" is a common/stop word, and hence not a keyword by itself. But other than that, many quality compound keywords are successfully assembled from the list of unigram keywords. The last part of the keyword extraction phase is removing duplicated (n-1)-gram keywords that have been used to construct n-gram keywords. This step is to avoid having "natural language" and "language processing" as bigram keywords whenever "natural language processing" is a keyword. The removal algorithm is also inductive in nature (and somewhat complex), but a nice property it has is that if "linux operating system" is a keyword, it is still possible for "linux" to be a keyword by itself provided it's appeared by itself enough. Although I didn't get enough time to experiment with other keyword extraction algorithms, I thought about using other cues to improve results. For one, it would be nice if each word in the text had been tagged for its part of speech. A general observation is that keywords are most often nouns or compound nouns. Or to think of this from the opposite direction, it is rare to find adverbs and adjectives as keywords, intuitively. The presence of part of speech tags would allow an additional constraint. Actually, an machine-accessible dictionary with part of speech listing for a word could be used to approximate this "keyword is a noun" constraint as well. Secondly, collocated keywords maybe "verified" by consulting Google's search database. If an n-gram is found to appear in a large number of web documents, then likely such an n-gram is a valid keyword. However, the limited number of Google web service requests disallowed this trick. On well-behaving documents, ones with plenty of well-formed English sentences, the result of keyword extraction is spectacular (more supporting data to follow). However, there are pages out on the web that are entirely Flash-based, image-based, or frame-based. In those cases, no keywords will be found, and such a document will become featureless. This is a huge problem, as PEG won't be able to do anything about these misbehaving webpages at all. Stage 4: Automatically cluster web documents by keywords Given keywords as document features, we can prepare the space of documents for unsupervised clustering. Our goal is two-fold: Group similar documents together Find well-separated clusters Each point in this document space is and completely and uniquely identified by its set of keywords. This is a non-Euclidean space without the normal notion of n-dimensional coordinates, but a distance (similarity) function between two documents is definable. There are three types of distance functions that I must define: Keyword-to-Keyword Document-to-Document Cluster-to-Cluster For the first type of distance functions, a straightforward definition is the string equality test. Two keywords are equal if and only if the strings match completely. At first glance, this definition seems logical, but unfortunately, it doesn't handle n-gram keywords fairly. For instance, "linux" and "linux kernel", which is semantically similar, will be unfairly marked as dissimilar. To deal with n-gram keywords, I must allow fractional/partial similarity. Again, I do not know the theoretically optimal distance function in this case, but here's a logical option: keywordDistance(A,B) Let a be a substring of A Let b be a substring of B such that a == b, and len(a)=len(b) is maximized Let intersect = len(a) Let union = len(A) + len(B)  intersect return intersect / union This definition resembles the Jaccard measure. Some examples: distance("a b", "a") = 0.5, distance("a b c", "b c d") = 0.5, distance("a b c", "a c") = 0.33. Other more indirect metrics may be used to compare keywords. For example, I could try to use WordNet's synonym listing to approximate the conceptual distance between terms representing similar ideas, like how "ship" is close the meaning of "boat". However, WordNet's database most of the time was inadaquate in handling non-standard terms, like "Oracle" as a software company., or n-gram keywords. Normally, we can use the Jaccard or Cosine measure to compute document-to-document distance function. However, with fractional keyword equality, I have to abandon the Jaccard measure because the set of keywords is no longer a set of distinct objects. documentDistance(A,B) total = 0 For each <i,j> where 0 <= i < len(A), 0 <= j < len(B) total = total + keywordDistance(A(i), B(j)) return total / (len(A) * len(B)) Essentially, it's computing the average word-to-word distance for each pairs of keywords in the two sets. It has the property that documentDistance(A,B) == documentDistance(B,A). The effect of the Cluster-to-Cluster distance function is more subtle. I've experimented with two definitions. In the first one, I defined the inter-cluster instance to be the distance of the two closest points between the two clusters. The advantage of this definition is that it allows for more "irregularly-shaped" clusters, but it's also less stable, leading to ill-separated clusters. The second definition is based on the average document-to-document distance between each pair of documents in the two clusters. The advantage of this definiton is robustness against linking two well separated large clusters, but the shapes of the clusters tend to be "spherical" (i.e: likes to expand uniformly in all direction). Fortunately, due to the relatively few data points (30-50), both definitions perform about identically. I employ the straightforward Hierarchical Clustering algorithm to group the documents, but exactly how do you determine the optimal number of clusters? We can compute the quality of a cluster by its "compactness", approximated by its average pair-wise document-to-document distances (think: clique). It's a measure of how well-separated clusters are to one another. If we define compact(n) to be the micro-average cluster compactness for having n clusters (chosen by the hierachical clustering algorithm), then it can be proven that it's an increasing function (decreases as n approaches 1); when n == number of documents, compact(n) = 1.0, but as soon as we start grouping different documents together, the overall compactness decreases, approaching 0.0. Naturally, we want clusters to be as tightly packed as possible, but it can also be argued to we prefer as few necessary groups as possible. My hope is that these two competing qualities will single out an optimal number of clusters. I do not know for certain how to model "as few necessary groups as possible", and without a training set, it's tough to guess thresholds. In the end after taking PEG for a few manual test drives, I decided to use a normal distribution, with a mean of 6.0 and variance of 5.0 (in units of clusters). These are arbitrary, but nevertheless not completely random. Afterwards, each cluster is labeled by its prominent keywords, where prominence is measured by the number of documents in the cluster containing a particular keyword. I choose the top four most prominent keywords as to control the length of a cluster label. Regrettably, I'm most dissatisfied with this part of the project where I have to determine the optimal number of clusters without any reasonable justification. There has to be a better, more logically-sound way to compute this function, but unfortunately I ran out of time. An EM approach maybe able to group and label clusters simultaneously. Stage 5: Generate HTML output for the end-user view With the clustered groups, I'm finally ready to generate HTML code for the search results page. This is straightforward, but there are a few features worth mentioning. First, because PEG goes out of its way to download web documents pointed by the search results, it is able to show precisely which URLs are unavailable at the moment (e.g: 404, 403 errors). Secondly, the result page includes the list of keywords associated with each document, which acts as a decent summary of the content ahead. Results and performance The problem at hand is too complex for a machine to automatically decide whether a particular cluster configuration is "good". Without a usable traininig set, tuning the system and experimenting with new ideas quickly became an exhaustive task of testing on a few hopefully representative test search queries and eyeballing results manually. Certainly, this is not the best approach in judging an NLP system. We start with the performance of the keyword extraction engine. With the default parameters, I computed the sets of keywords, ordered by importance, for the following sites (as of today): threshold=80, count=4 http://www.berkeley.edu/uc, berkeley, campus, sites, student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/u.s, cnn.com, cnn, videohttp://www.stanford.edu/class/cs224n/java, natural language processing, language processing, processing, assignments, assignment, gates, section, programming, pleasehttp://nlp.stanford.edu/~manning/+1 650, nlp, probabilistic, ling, linguistics, grammar, natural language processing, christopher manning, language processing, manning If I reduce the ratio threshold to 40, I will obtain the following additional keywords: threshold=40, count=4 http://www.berkeley.edu/http://www.stanford.edu/http://www.mit.edu/newshttp://www.cnn.com/sportshttp://www.stanford.edu/class/cs224n/final projecthttp://nlp.stanford.edu/~manning/ As you can see, there is not a whole lot of differences despite a significant reduction in the ratio threshold. This is good news because it means good keywords tend to be good at prevailing themselves from the norm. Let's reduce the count threshold, while returning the ratio threshold back to its original value: threshold=80, count=3 http://www.berkeley.edu/uc, berkeley, cal, campus, extension, sites, sessions, student, students, junehttp://www.stanford.edu/alumni, stanford university, stanfordhttp://www.mit.edu/mit, events, programshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, jagger, cnn, video, die, gashttp://www.stanford.edu/class/cs224n/ling, java, natural language processing, language processing, processing, directory, assignments, assignment, manning, gateshttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic, ling, science linguistics, linguistics, computational, grammar threshold=80, count=2 http://www.berkeley.edu/2002, a-z, sessions uc extension, finder map, uc, undergraduate, berkeley, cal student, web sites, librarieshttp://www.stanford.edu/directories, alumni, stanford university, stanfordhttp://www.mit.edu/mit web, search mit, mit, directory, events calendar, faculty staff, tech, campus, offices programs, massachusettshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, time.com, 2002, jagger knighted, mick jagger, in-depth, swamp netshttp://www.stanford.edu/class/cs224n/cs 224n ling 237 natural language processing, eval.pl script, e.g., hyponym, wmhome, jwordnet, stdin", ling, java, computationalhttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic models, probabilistic, hinrich sch tze, 2001, 2000, 1999 A lot more unimportant keywords seem to have cropped up when the threshold is set too low. Let's increase the count threshold: threshold=80, count=5 http://www.berkeley.edu/student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/cnn, videohttp://www.stanford.edu/class/cs224n/language processing, processing, section, naturalhttp://nlp.stanford.edu/~manning/probabilistic, linguistics, natural language processing, christopher manning, lanugage processing, manning, processing, text, language, natural It seems a count threshold of 5 (or higher) is too restrictive. Too many otherwise good and valid keywords are discarded. A systematic method of evaluating clustering results is not immediately obvious. But I can demonstrate its effectiveness with a few examples. Let's try "Berkeley": query=Berkeley cluster namecluster sizecorrect countcorrect percentage1students22100%2seti@home, june 2002, 2002, sponsors11100%3library, libraries33100%4library, uc, collections, web11100%5berkeley, information8787.5%6seismological laboratory, earthquake information, earthquake, berkeley11100%7web22100%8{unrelated/lonely}12 Of course, it easy to have high correct percentage if cluster size is small and the cluster group general enough. The documents in the {unrelated/lonely} cluster are either featureless, or unrelated to all other documents in the pool. Unfortunately, although the quality of individual clusters is decent, we would have preferred a fewer clusters. For example, the "students" and "berkeley, information" should form a larger cluster. Also, the number of unsorted documents fallen into the {unrelated} cluster is substantial. I mentioned that earlier that I evaluate cluster "compactness" by computing its average internal distance in a clique of documents configuration, how the compactness function is a decreasing function as the number of clusters approaches one. Here's an illustration:  Note that this function doesn't exhibit a "cliff", or a suddent drop, as n approaches 0. Let's search for "doom": query=doom cluster namecluster sizecorrect countcorrect percentage1doom, doom iii, 2002, iii161487.5%2science fiction fantasy horror, planet doom, sites, science fiction11100%3sitebad category4sgi doom, sgi version, version sgixdoom11100%532+32kwb asus tx97-e 430tx.... 2000, pv, pci11100%6armdeu 1.17, armdeu, wadptr 2.3, wadptr11100%7linux, 10%, 2.5%, freebsd11100%8{unrelated/lonely}7 The cluster labeled "site" is a bad category because it doesn't mean anything. It would have been preferred if the "sgi doom" and "32+32kwb" (a benchmarking site) clusters are merged with the main "doom" (the idSoftware game) cluster. PEG was able to single out a cluster for a horror story, and an essay about linux being Microsoft's impending doom.   Related work in the literature There have been several text clustering systems developed in the past. Xerox PARC's Scatter/Gather implemented a similar system, but it didn't try to distill document features (i.e: keywords), and it used a straightforward cosine measure to define inter-document distances. Andrei Broder of DEC had also done some work in defining a inter-document distance function specifically targeted for web document. It used the mathematical notion of "resemblance" and "containment" (symmetric vs asymmetric relationship), the latter of which PEG does not attempt to account for. Instead of using keywords, it used shingles, which are essentially n-grams.  EMBED Excel.Sheet.8   EMBED Excel.Sheet.8   EMBED Excel.Sheet.8  v:x:::;j@l@n@@VEXEjFlFNNOOPOTTVVW|WWWZZ$[$[[[B\\\__6____B`z`|`TaabHfJfNhPhkkllmm & F hmqq uuzz,zĀNDz|8:npplnp"\8^`D|ޤ \̦Φ & F h & F hΦ2FH.0̿οNNP**\^d$$l $JLNv"$hjlnHlhp$$l $fhjl2 @\<4 $$l $FBt:PR dd$$l $Hhjlbd"HJN`dhrtT$$ljm $$$l aLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PDyK sammysy@cs.stanford.eduyK >mailto:sammysy@cs.stanford.eduDd 0  # A2cw~Qk +2 `!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<is "good". Without a usable traininig set, tuning the system and experimenting with new ideas quickly became an exhaustive task of testing on a few hopefully representative test search queries and eyeballing results manually. Certainly, this is not the best approach in judging an NLP system. We start with the performance of the keyword extraction engine. With the default parameters, I computed the sets of keywords, ordered by importance, for the following sites (as of today): threshold=80, count=4 http://www.berkeley.edu/uc, berkeley, campus, sites, student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/u.s, cnn.com, cnn, videohttp://www.stanford.edu/class/cs224n/java, natural language processing, languageThe space of all human-understandable terms is simply too huge. Moreover, word senses associated with a term can change over time, and thus maintaining a database of all sets of word senses would require too much work and human intervention. These problematic issues will be addressed more thoroughly in later sections of this report. My implemented solution involves inserting an extra processing layer between the users and a selected search engine, Google. This layer, which shall be referred to as Project Enhanced Google or PEG for short, will try to categorize search results returned by Google and present the categorized results neatly to the user. PEG's success will be judged by how well it catagorizes web documents, and how much convenience it brings to its users. PEG has a five-stage pipeline: Invoke the parent search engine to obtain a list of "raw" search results. Fetch and purify web documents. Extract features from web documents. Based on the extracted features, cluster the documents into groups by their similarity. Generate HTML for the end-user view. Detailed discussion of algorithms/method used Stage 1: Obtain raw search results I picked Google as the parent search engine because I like it myself, and because it allows search requests via a SOAP interface. By using the web service directly, I save the need to parse the usual HTML output manually while at the same time maintain future compatibility with this well defined interface. The plumbing behind this stage is of little NLP interest, and so I will reframe from elaborating further. Do keep in mind though that Google limits each licensee of the web service to 1000 requests per day. The quota is plentiful for obtaining search results, but I can't utilize Google's vast database to, for example, help examine keywords on a word-by-word basis. Stage 2: Fetch and purify web documents The Network API in Java is one of its strong points, so making HTTP requests to fetch web documents is straightforward. Originally, I was fetching documents sequentially (synchronous calls), but as soon as I started downloading 30 or 40 in one go, the total time spent just in this stage would exceed many minutes due to long socket timeouts and slow servers. To establash a soft upperbound in time requirement, I had to multithread this portion of the program to download multiple documents simultaneously, and upgrade to JDK 1.4 which provided an easy way to shorten socket timeouts. Some non-text document sources are deemed unprocessible by PEG. I exclude URLs with certain file extensions like .zip, .doc, .pdf, etc. The purification procedure is more interesting. I decided to convert everything to lower case. It's true that I will lose all information provided by word capitalization, but in return, the problem of data sparseness is alleviated, which I consider a more immediate problem. HTML documents on the web are not what you normally consider "clean" text. They often contain nasty junk like malformed XML syntax, javascript, symbols; basically, it's often the case that you don't find any clean paragraph of English sentences on a page. There's not much I can do about them, but I try my best using various adhoc heuristics, like filtering out tokens without any alphanumeric character, small integers, and stopwords. In addition, I try to purify each token by cropping away outer junk characters (e.g: punctuations, parenthesis). The purified text, judged by my eyes, is generally satisfactory and usable for all intents and purposes. Note that I throw away all information about sentence boundary, although the general word order is undisturbed. I learned that parsing web data is a huge pain in the neck. Stage 3: Extract features Extracting features from web documents turned out to be the core of the project and the most interesting problem from the NLP point of view. A web document contains a rich amount of information, from the text itself, URL, hyperlinks, HTML structure, to images. Being an NLP project, I decided to treat a web document simply as a sequence of words, without exploiting any of the informative HTML structure (e.g: title string) or hyperlinks. The quality of extracted features will suffer slightly, but at the same time the result of the project will reflect unambiguously the (in)effectiveness of NL processing, processing, assignments, assignment, gates, section, programming, pleasehttp://nlp.stanford.edu/~manning/+1 650, nlp, probabilistic, ling, linguistics, grammar, natural language processing, christopher manning, language processing, manning If I reduce the ratio threshold to 40, I will obtain the following additional keywords: threshold=40, count=4 http://www.berkeley.edu/http://www.stanford.edu/http://www.mit.edu/newshttp://www.cnn.com/sportshttp://www.stanford.edu/class/cs224a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bq= THRESHOLD_RATIO and count >= THRESHOLD_COUNT In English, the algorithm describes that a word is a keyword if the following two contraints are satisifed: The word appears abnormally frequent The word has at least repeated several times in the document The constraints make a lot of intuitive sense, and reflect the definition of keywords. The tricky part is deciding what to assign the two threshold values. As alluded in the introduction, I don't have access to a training set (general documents with labeled keywords), so it's impossible to train for threshold values aside from using my own judgement. This lack of a training set would continue to haunt me later from many other angles. Perhaps I should have placed more emphasis in locating a usable training set from the get go. It appears that THRESHOLD_RATIO=80 and THRESHOLD_COUNT=4 work decently well. Again, these values were picked without any theoretical backing, but most of the time the keyword selection is surprisingly accurate. Rarely does the algorithm fail disastrously. The decision of whether a word is a keyword is not entirely binary. Some keywords are naturally more relevant than others in a document. For example, given all else equal, a word that appears twice as frequently as the other word is obviously more likely to carry more relevance. Similarly, a word with a higher ratio should be considered more relevant. I use a simple scheme in computing this "importance" value. importance(keyword) = ratio(keyword) * count(keyword) Because the ratio equation is divisional, it can easily overwhelm the count value. Therefore, an unknown word may end up grabbing a huge share of importance, which arguably is the correct course of action. By not using raw text as features, I avoid potential problems comparing two documents with highly skewed lengths. As long as the number of keywords for each document is controlled, comparison among documents will not run into unnecessary biases towards or against long or short documents. As a protective measure, I keep only the top 10 (MAX_KEYWORDS) most important keywords to set a hard ceiling on the size of the feature set. So far, I have considered locating keywords which are unigrams, but there is a big problem: a keyword often comes packaged as a bigram, trigram, or an n-gram in general. For example, among many other things, names of people are bigrams or trigrams. Also, for many compound terms, the "worth" of the whole is greater than the sum of its individual constituents. "Operating System" means something special when used together, for instance. Indeed, the problem of finding collocated keywords is real and crucial. How should I extend the algorithm? The first question I asked was whether it is feasible to include more information with my language model. I thought about building a bigram model, which would be huge in size due to sheer number of unique bigrams expected to be found. If the space of unigram is barely managable, then the space of bigrams is certainly unthinkable. The problem of data sparseness with a bigram model will simply balloon to the point of being unusable. In fact, the number of false positives for bigram keywords is too great to justify the extra space (HUGE!) needed for the bigram model. Not only that, a bigram model still doesn't scale well to deal with n-gram keywords regardless. Then I started thinking inductively. If I make one plausible assumption, I can collect collocated keywords relatively straightforwardly: an n-gram is a keyword only if all its subing, language processing, processing, directory, assignments, assignment, manning, gateshttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic, ling, science linguistics, linguistics, computational, grammar threshold=80, count=2 http://www.berkeley.edu/2002, a-z, sessions uc extension, finder map, uc, undergraduate, berkeley, cal student, web sites, librarieshttp://www.stanford.edu/directories, alumni, stanford university, stanfordhttp://www.mit.edu/mit web, search mit, mitd 0  # A2|$C9V/DXB: `!P$C9V/D (#GJ*xZ hU~wS R0C  ,6J\[Wm)6c(I0\*(EjR2)I (ZI#B? o=9wg}sy{''٘IY^XWz'yB.Bқ"K"8x~&2$NVKvEww7ơ}. 7#F4NL_ d[2O#3]3GdHņFc? [S|[ckSU܍yy|Mּ>IKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(.D, directory, events calendar, faculty staff, tech, campus, offices programs, massachusettshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, time.com, 2002, jagger knighted, mick jagger, in-depth, swamp netshttp://www.stanford.edu/class/cs224n/cs 224n ling 237 natural language processing, eval.pl script, e.g., hyponym, wmhome, jwordnet, stdin", ling, java, computationalhttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic models, probabilistic, hinrich sch tze, 2001, 2000, 1999 A lot more unimportant keywords seem to have cropped up when the threshold is set too low. Let's increase the count threshold: threshold=80, count=5 http://www.berkeley.edu/student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.estrings of length (n-1) are also keywords. For example, if "operating" and "system" have both established themselves as keywords, then "operating system" and "system operating" are both candidates to become bigram keywords. I will enforce an additional constraint where the n-gram keyword must also have appeared at least a certain number of times to qualify. By the way, this algorithm was inspired by the A-Priori algorithm for computing frequent item sets. Recursive algorithm: An n-gram w is a keyword if: If (n==1) then, return true if w is a unigram keyword; return false otherwise. Otherwise, Let left = w[0,len-2] // substring Let right = w[1,len-1] Let count = count(w|doc) return true if both left and right are keywords and count >= THRESHOLD_COUNT return false otherwise. Note that I'm not accounting for the ratio factor which I used in judging unigram keywords. The reason is that prob(n-gram|language) is a tough one to define with just a unigram frequency language model. The Naive Bayes Assumption, which worked admirably in the word sense disambiguation project, was a complete failure here. Given the immense size of the corpus, prob(unigram|language) was already tiny. Applying the Naive Bayes Assumption, a series of successive multiplications, led to microscopic prob(n-gram|language), which in turn led to exceeding large and useless ratio values. By making the assumption that an n-gram is a keyword only if all its substrings are also keywords, I'm oversimplying the space of n-gram keywords. A serious shortcoming is true negatives on n-gram keywords which contain a common word. For instance, "My Little Lover", a popular band in Japan, will never be collected as a trigram keyword because "my" is a common/stop word, and hence not a keyword by itself. But other than that, many quality compound keywords are successfully assembled from the list of unigram keywords. The last part of the keyword extraction phase is removing duplicated (n-1)-gram keywords that have been used to construct n-gram keywords. This step is to avoid having "natural language" and "language processing" as bigram keywords whenever "natural language processing" is a keyword. The removal algorithm is also inductive in nature (and somewhat complex), but a nice property it has is that if "linux operating system" is a keyword, it is still possible for "linux" to be a keyword by itself provided it's appeared by itself enough. Although I didn't get enough time to experiment with other keyword extraction algorithms, I thought about using other cues to improve results. For one, it would be nice if each word in the text had been tagged for its part of speech. A general observation is that keywords are most often nouns or compound nouns. Or to think of this from the opposite direction, it is rare to find adverbs and adjectives as keywords, intuitively. The presence of part of speech tags would allow an additional constraint. Actually, an machine-accessible dictionary with part of speech listing for a word could be used to approximate this "keyword is a noun" constraint as well. Secondly, collocated keywords maybe "verified" by consulting Google's search database. If an n-gram is found to appear in a large number of web documents, then likely such an n-gram is a valid keyword. However, the limited number of Google web service requests disallowed this trick. On well-behaving documents, ones with plenty of well-formed English sentences, the result of keyword extraction is spectacular (more supporting data to follow). However, there are pages out on the web that are entirely Flash-based, image-based, or frame-based. In those cases, no keywords will be found, and such a document will become featureless. This is a huge problem, as PEG won't be able to do anything about these misbehaving webpages at all. Stage 4: Automatically cluster web documents by keywords Given keywords as document features, we can prepare the space of documents for unsupervised clustering. Our goal is two-fold: Group similar documents together Find well-separated clusters Each point in this document space is and completely and uniquely identified by its set of keywords. This is a non-Euclidean space without the normal notion of n-dimensional coordinates, but a distance (similarity) function between two documents is definable. There are three types of distance functions that I must define: Keyword-to-Keyword Document-to-Document Cluster-to-Cluster For the first type of distance functions, a straightforward definition is the string equality test. Two keywords are equal if and only if the strings match completely. At first glance, this definition seems logical, but unfortunately, it doesn't handle n-gram keywords fairly. For instance, "linux" and "linux kernel", which is semantically similar, will be unfairly marked as dissimilar. To deal with n-gram keywords, I must allow fractional/partial similarity. Again, I do not know the theoretically optimal distance function in this case, but here's a logical option: keywordDistance(A,B) Let a be a substring of A Let b be a substring of B such that a == b, and len(a)=len(b) is maximized Let intersect = len(a) Let union = len(A) + len(B)  intersect return intersect / union This definition resembles the Jaccard measure. Some examples: distance("a b", "a") = 0.5, distance("a b c", "b c d") = 0.5, distance("a b c", "a c") = 0.33. Other more indirect metrics may be used to compare keywords. For example, I could try to use WordNet's synonym listing to approximate the conceptual disdu/mithttp://www.cnn.com/cnn, videohttp://www.stanford.edu/class/cs224n/language processing, processing, section, naturalhttp://nlp.stanford.edu/~manning/probabilistic, linguistics, natural language processing, christopher manning, lanugage processing, manning, processing, text, language, natural It seems a count threshold of 5 (or higher) is too restrictive. Too many otherwise good and valid keywords are discarded. A systematic method of evaluating clustering results is not immediately obvioud 0  # A2FJO~>(v`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6tance between terms representing similar ideas, like how "ship" is close the meaning of "boat". However, WordNet's database most of the time was inadaquate in handling non-standard terms, like "Oracle" as a software company., or n-gram keywords. Normallys. But I can demonstrate its effectiveness with a few examples. Let's try "Berkeley": query=Berkeley cluster namecluster sizecorrect countcorrect percentage1students22100%2seti@home, june 2002, 2002, sponsors11100%3library, libraries33100%4library, uc, collections, web11100%5berkeley, information8787.5%6seismological laboratory, earthquake information, earthquake, berkeley11100%7web22100%8{unrelated/lonely}12 Of course, it easy to have high correct percentage if cluster size is small and the cluster group general enough. The documents in the {unrelated/lonely} cluster are either featureless, or unrelated to all other documents in the pool. Unfortunately, although the quality of individual clusters istxTX\fhl@|ߨߌL$$ljm $@DHRTX`dhrtx@l$$ljm $:`bf@DHRTXb|~ݤ@\$$ljm $HLPZ\`h$$$ljm :>@BDF TV|~lnprtv$$ljm $, we can use the Jaccard or Cosine measure to compute document-to-document distance function. However, with fractional keyword equality, I have to abandon the Jaccard measure because the set of keywords is no longer a set of distinct objects. documentDistance(A,B) total = 0 For each <i,j> where 0 <= i < len(A), 0 <= j < len(B) total = total + keywordDistance(A(i), B(j)) return total / (len(A) * len(B)) Essentially, it's computing the average word-to-word distance for each pairs of keywords in / =!"#$%`!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqIKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ېthe two sets. It has the property that documentDistance(A,B) == documentDistance(B,A). The effect of the Cluster-to-Cluster distance function is more subtle. I've experimented with two definitions. In the first one, I defined the inter-cluster instancen$p$$$&&&&' 'V'X'''''^_____2`4`cccdd$T'X'Z'''''''__d_f_h_______o"ptunzzPܑRnzzpD++ 3 33344f<h<<<<<<<<<j4@ UV j3_Uj@ UVOJQJ5 6OJQJ H*OJQJ0JjP^OJQJUjOJQJUOJQJ j"WUj@ UV jUA'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PKarWaii (Sammy) Sy 4893129  HYPERLINK mailto:sammysy@cs.stanford.edu sammysy@cs.stanford.edu June 9th, 2002 CS224n: Project Enhanced Google Problem to be investigated Most Internet search engines, given a search query, try to find web documents containing the terms listed in the query. However, because the search process treats words purely as semantic-less identifiers, these same search engines are unable to distinguish different word senses, not mentioning documents about different topical domains. Frequently, users must manually scan through a long list of interleaved results in order to zoom in onto the desired ones. A solution using an NLP approach One way to solve the problem is to figure out which one of the different word senses that a document assumes. For instance, let's say the term "Michael Jordan" carries two senses, the basketball player and the Berkeley professor. A document containing "Michael Jordan" will use the term to denote either the basketball player or the professor, but probably not both. This is an important property because we can rely on the set of word senses to categorize documents, and thereby solving the above problem. Our goal now has been reduced to defining such a function that maps the set of documents containing W to the set of W's word senses. Like all other NLP problems, however, the gut of this function is completely non-trivial. For example, given a search query, how do we obtain the set of word senses? One possible solution is to use a dictionary like WordNet, but this approach falls apart quickly as soon as we have to deal with non-standard English terminologies, acronyms, compound nouns, etc. The space of all human-understandable terms is simply too huge. Moreover, word senses associated with a term can change over time, and thus maintaining a database of all sets of word senses would require too much work and human intervention. These problematic issues will be addressed more thoroughly in later sections of this report. My implemented solution involves inserting an extra processing layer between the users and a selected search engine, Google. This layer, which shall be referred to as Project Enhanced Google or PEG for short, will try to categorize search results returned by Google and present the categorized results neatly to the user. PEG's success will be judged by how well it catagorizes web documents, and how much convenience it brings to its users. PEG has a five-stage pipeline: Invoke the parent search engine to obtain a list of "raw" search results. Fetch and purify web documents. Extract features from web documents. Based on the extracted features, cluster the documents into groups by their similarity. Generate HTML for the end-user view. Detailed discussion of algorithms/method used Stage 1: Obtain raw search results I picked Google as the parent search engine because I like it myself, and because it allows search requests via a SOAP interface. By using the web service directly, I save the need to parse the usual HTML output manually while at the same time maintain future compatibility with this well defined interface. The plumbing behind this stage is of little NLP interest, and so I will reframe from elaborating further. Do keep in mind though that Google limits each licensee of the web service to 1000 requests per day. The quota is plentiful for obtaining search results, but I can't utilize Google's vast database to, for example, help examine keywords on a word-by-word basis. Stage 2: Fetch and purify web documents The Network API in Java is one of its strong points, so making HTTP requests to fetch web documents is straightforward. Originally, I was fetching documents sequentially (synchronous calls), but as soon as I started downloading 30 or 40 in one go, the total time spent just in this stage would exceed many minutes due to long socket timeouts and slow servers. To establash a soft upperbound in time requirement, I had to multithread this portion of the program to download multiple documents simultaneously, and upgrade to JDK 1.4 which provided an easy way to shorten socket timeouts. Some non-text document sources are deemed unprocessible by PEG. I exclude URLs with certain file extensions like .zip, .doc, .pdf, etc. The purification procedure is more interesting. I decided to convert everything to lower case. It's true that I will lose all information provided by word capitalization, but in return, the problem of data sparseness is alleviated, which I consider a more immediate problem. HTML documents on the web are not what you normally consider "clean" text. They often contain nasty junk like malformed XML syntax, javascript, symbols; basically, it's often the case that you don't find any clean paragraph of English sentences on a page. There's not much I can do about them, but I try my best using various adhoc heuristics, like filtering out tokens without any alphanumeric character, small integers, and stopwords. In addition, I try to purify each token by cropping away outer junk characters (e.g: punctuations, parenthesis). The purified text, judged by my eyes, is generally satisfactory and usable for all intents and purposes. Note that I throw away all information about sentence boundary, although the general word order is undisturbed. I learned that parsing web data is a huge pain in the neck. Stage 3: Extract features Extracting features from web documents turned out to be the core of the project and the most interesting problem from the NLP point of view. A web document contains a rich amount of information, from the text itself, URL, hyperlinks, HTML structure, to images. Being an NLP project, I decided to treat a web document simply as a sequence of words, without exploiting any of the informative HTML structure (e.g: title string) or hyperlinks. The quality of extracted features will suffer slightly, but at the same time the result of the project will reflect unambiguously the (in)effectiveness of NLP algorithms. Instead of using raw text as the feature set, I decided to distill it a bit by extracting keywords. "Keywords" of a document is defined to be a set of tokens that appear abnormally frequently in the document compared to the norm (i.e: normal English). Such a definition translates directly to a computable algorithm. Let count(w|doc) = { occurances of word w in document doc } Let count(doc) = { the number of tokens in doc } Hence, prob(w|doc) = count(w|doc) / count(doc) To model "normal English", I went to three years worth of New York Times articles from 1994 to 1996. 1.7 gigabytes of text is a ton of data, and in the end, I was able to digest approximately 240 million words, 640,000 of which are unique. I have to apply a primitive smoothing adjustment to handle unknown words, which are encounted often on the web. Let count(w|langauge) = { occurances of word w in the corpus } Let count(language) = { total number of tokens in the corpus } prob(w|language) = if count(w|language)==0, then return 0.5 / count(language) otherwise, return count(w|language) / count(language) Note that with this simple smoothing, prob(w|language) is no longer a valid probability mass function, but that is inconsequential as only the proportion is relevant in my calculation. Because count(language) is such a huge number, 0.5 is a decent guess of the number of occurances of unknown words. For each w in doc Let ratio = prob(w|doc) / prob(w|language) Let count = count(w|doc) w is a keyword of doc if ratio >= THRESHOLD_RATIO and count >= THRESHOLD_COUNT In English, the algorithm describes that a word is a keyword if the following two contraints are satisifed: The word appears abnormally frequent The word has at least repeated several times in the document The constraints make a lot of intuitive sense, and reflect the definition of keywords. The tricky part is deciding what to assign the two threshold values. As alluded in the introduction, I don't have access to a training set (general documents with labeled keywords), so it's impossible to train for threshold values aside from using my own judgement. This lack of a training set would continue to haunt me later from many other angles. Perhaps I should have placed more emphasis in locating a u to be the distance of the two closest points between the two clusters. The advantage of this definition is that it allows for more "irregularly-shaped" clusters, but it's also less stable, leading to ill-separated clusters. The second definition is based on the average document-to-document distance between each pair of documents in the two clusters. The advantage of this definiton is robustness against linking two well separated large clusters, but the shapes of the clusters tend to be "spherical" (i.e: likes to expand uniformly in all direction). Fortunately, due to the relatively few data points (30-50), both definitions perform about identically. I employ the straightforward Hierarchical Clustering algorithm to group the documents, but exactly how do you determine the optimal number of clusters? We can compute the quality of a cluster by its "compactness", approximated by its average pair-wise document-to-document distances (think: clique). It's a measure of how well-separated clusters are to one another. If we define compact(n) to be the micro-average cluster compactness for having n clusters (chosen by the hierachical clustering algorithm), then it can be proven that it's an increasing function (decreases as n approaches 1); when n == number of documents, compact(n) = 1.0, but as soon as we start grouping different documents together, the overall compactness decreases, approaching 0.0. Naturally, we want clusters to be as tightly packed as possible, but it can also be argued to we prefer as few necessary groups as possible. My hope is that these two competing qualities will single out an optimal number of clusters. I do not know for certain how to model "as few necessary groups as possible", and without a training set, it's tough to guess thresholds. In the end after taking PEG for a few manual test drives, I decided to use a normal distribution, with a mean of 6.0 and variance of 5.0 (in units of clusters). These are arbitrary, but nevertheless not completely random. Afterwards, each cluster is labeled by its prominent keywords, where prominence is measured by the number of documents in the cluster containing a particular keyword. I choose the top four most prominent keywords as to control the length of a cluster label. Regrettably, I'm most dissatisfied with this part of the project where I have to determine the optimal number of clusters without any reasonable justification. There has to be a better, more logically-sound way to compute this function, but unfortunately I ran out of time. An EM approach maybe able to group and label clusters simultaneously. Stage 5: Generate HTML output for the end-user view With the clustered groups, I'm finally ready to generate HTML code for the search results page. This is straightforward, but there are a few features worth mentioning. First, because PEG goes out of its way to download web documents pointed by the search results, it is able to show precisely which URLs are unavailable at the moment (e.g: 404, 403 errors). Secondly, the result page includes the list of keywords associated with each document, which acts as a decent summary of the content ahead. Results and performance The problem at hand is too complex for a machine to automatically decide whether a particular cluster configuration is "good". Without a usable traininig set, tuning the system and experimenting with new ideas quickly became an exhaustive task of testing on a few hopefully representative test search queries and eyeballing results manually. Certainly, this is not the best approach in judging an NLP system. We start with the performance of the keyword extraction engine. With the default parameters, I computed the sets of keywords, ordered by importance, for the following sites (as of today): threshold=80, count=4 http://www.berkeley.edu/uc, berkeley, campus, sites, student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/u.s, cnn.com, cnn, videohttp://www.stanford.edu/class/cs224n/java, natural language processing, language processing, processing, assignments, assignment, gates, section, programming, pleasehttp://nlp.stanford.edu/~manning/+1 650, nlp, probabilistic, ling, linguistics, grammar, natural language processing, christopher manning, language processing, manning If I reduce the ratio threshold to 40, I will obtain the following additional keywords: threshold=40, count=4 http://www.berkeley.edu/http://www.stanford.edu/http://www.mit.edu/newshttp://www.cnn.com/sportshttp://www.stanford.edu/class/cs224n/final projecthttp://nlp.stanford.edu/~manning/ As you can see, there is not a whole lot of differences despite a significant reduction in the ratio threshold. This is good news because it means good keywords tend to be good at prevailing themselves from the norm. Let's reduce the count threshold, while returning the ratio threshold back to its original value: threshold=80, count=3 http://www.berkeley.edu/uc, berkeley, cal, campus, extension, sites, sessions, student, students, junehttp://www.stanford.edu/alumni, stanford university, stanfordhttp://www.mit.edu/mit, events, programshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, jagger, cnn, video, die, gashttp://www.stanford.edu/class/cs224n/ling, java, natural language processing, language processing, processing, directory, assignments, assignment, manning, gateshttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic, ling, science linguistics, linguistics, computational, grammar threshold=80, count=2 http://www.berkeley.edu/2002, a-z, sessions uc extension, finder map, uc, undergraduate, berkeley, cal student, web sites, librarieshttp://www.stanford.edu/directories, alumni, stanford university, stanfordhttp://www.mit.edu/mit web, search mit, mit, directory, events calendar, faculty staff, tech, campus, offices programs, massachusettshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, time.com, 2002, jagger knighted, mick jagger, in-depth, swamp netshttp://www.stanford.edu/class/cs224n/cs 224n ling 237 natural language processing, eval.pl script, e.g., hyponym, wmhome, jwordnet, stdin", ling, java, computationalhttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic models, probabilistic, hinrich sch tze, 2001, 200 decent, we would have preferred a fewer clusters. For example, the "students" and "berkeley, information" should form a larger cluster. Also, the number of unsorted documents fallen into the {unrelated} cluster is substantial. I mentioned that earlier that I evaluate cluster "compactness" by computing its average internal distance in a clique of documents configuration, how the compactness function is a decreasing function as the number of clusters approaches one. Here's an illustration:  Note that this function doesn't exhibit a "cliff", or a suddent drop, as n approaches 0. Let's search for "doom": query=doom cluster namecluster sizecorrect countcorrect percentage1doom, doom iii, 2002, iii161487.5%2science fiction fantasy horror, planet doom, sites, science fiction11100%3sitebad category4sgi doom, sgi version, version sgixdoom11100%532+32kwb asus tx97-e 430tx.... 2000, pv, pci11100%6armdeu 1.17, armdeu, wadptr 2.3, wadptr11100%7linux, 10%, 2.5%, freebsd11100%8{unrelated/lonely}7 The cluster labeled "site" is a bad category because it doesn't mean anything. It would have been preferred if the "sgi doom" and "32+32kwb" (a benchmarking site) clusters are merged with the main "doom" (the idSoftware game) cluster. PEG was able to single out a cluster for a horror story, and an essay about linux being Microsoft's impending doom.   Related work in the literature There have been several text clustering systems developed in the past. Xerox PARC's Scatter/Gather implemented a similar system, but it didn't try to distill document features (i.e: keywords), and it used a straightforward cosine measure to define inter-document distances. Andrei Broder of DEC had also done some work in defining a inter-document distance function specifically targeted for web document. It used the mathematical notion of "resemblance" and "containment" (symmetric vs asymmetric relationship), the latter of which PEG does not attempt to account for. Instead of using keywords, it used shingles, which are essentially n-grams. Conclusion Because PEG employs Google as the underlying search engine, technically, the search results cannot be any worse than raw Google.  EMBED Excel.Sheet.8   EMBED Excel.Sheet.8   EMBED Excel.Sheet.8  zK*LtLvLxLLLMjRlRnRRVWXWjXlX``aaPaffhhi|iii & Fill$mmmBnnnqq6qqqqBrzr|rTsstHxJxNzPz}}~ & F h~~ ̑,zĒNDz|8:p:nplnp"\8^`D|޶ \ & F h & F h̸θ2FH.0o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PDyK sammysy@cs.stanford.eduyK >mailto:sammysy@cs.stanford.eduDd 0  # A2cw~Qk +2 `!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI0, 1999 A lot more unimportant keywords seem to have cropped up when the threshold is set too low. Let's increase the count threshold: threshold=80, count=5 http://www.berkeley.edu/student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/cnn, videohttp://www.stanford.edu/class/cs224n/language processing, processing, section, naturalhttp://nlp.stanford.edu/~manning/probabilistic, linguistics, natural language processing, christopher manning, lanugage processing, manning, processing, text, language, natural It seems a count threshold of 5 (or higher) is too restrictive. Too many otherwise good and valid keywords are discarded. A systematic method of evaluating clustering results is not immediately obviousable training set from the get go. It appears that THRESHOLD_RATIO=80 and THRESHOLD_COUNT=4 work decently well. Again, these values were picked without any theoretical backing, but most of the time the keyword selection is surprisingly accurate. Rarely does the algorithm fail disastrously. The decision of whether a word is a keyword is not entirely binary. Some keywords are naturally more relevant than others in a document. For example, given all else equal, a word that appears twice as frequently as the other word is obviously more likely to carry more relevance. Similarly, a word with a higher ratio should be considered more relevant. I use a simple scheme in computing this "importance" value. importance(keyword) = ratio(keyword) * count(keyword) Because the ratio equation is divisional, it can easily overwhelm the count value. Therefore, an unknown word may end up grabbing a huge share of importance, which arguably is the correct course of action. By not using raw text as features, I avoid potential problems comparing two documents with highly skewed lengths. As long as the number of keywords for each document is controlled, comparison among documents will not run into unnecessary biases towards or against long or short documents. As a protective measure, I keep only the top 10 (MAX_KEYWORDS) most important keywords to set a hard ceiling on the size of the feature set. So far, I have considered locating keywords which are unigrams, but there is a big problem: a keyword often comes packaged as a bigram, trigram, or an n-gram in general. For example, among many other things, names of people are bigrams or trigrams. Also, for many compound terms, the "worth" of the whole is greater than the sum of its individual constituents. "Operating System" means something special when used together, for instance. Indeed, the problem of finding collocated keywords is real and crucial. How should I extend the algorithm? The first question I asked was whether it is feasible to include more information with my language model. I thought about building a bigram model, which would be huge in size due to sheer number of unique bigrams expected to be found. If the space of unigram is barely managable, then the space of bigrams is certainly unthinkable. The problem of data sparseness with a bigram model will simply balloon to the point of being unusable. In fact, the number of false positives for bigram keywords is too great to justify the extra space (HUGE!) needed for the bigram model. Not only that, a bigram model still doesn't scale well to deal with n-gram keywords regardless. Then I started thinking inductively. If I make one plausible assumption, I can collect collocated keywords relatively straightforwardly: an n-gram is a keyword only if all its substrings of length (n-1) are also keywords. For example, if "operating" and "system" have both established themselves as keywords, then "operating system" and "system operating" are both candidates to become bigram keywords. I will enforce an additional constraint where the n-gram keyword must also have appeared at least a certain number of times to qualify. By the way, this algorithm was inspired by the A-Priori algorithm for computing frequent item sets. Recursive algorithm: An n-gram w is a keyword if: If (n==1) then, return true if w is a unigram keyword; return false otherwise. Otherwise, Let left = w[0,len-2] // substring Let right = w[1,len-1] Let count = count(w|doc) return true if both left and right are keywords and count >= THRESHOLD_COUNT return false otherwise. Note that I'm not accounting for the ratio factor which I used in judging unigram keywords. The reason is that prob(n-gram|language) is a tough one to define with just a unigram frequency language model. The Naive Bayes Assumption, which worked admirably in the word sense disambiguation project, was a complete failure here. Given the immense size of the corpus, prob(unigram|language) was already tiny. Applying the Naive Bayes Assumption, a series of successive multiplications, led to microscopic prob(n-gram|language), which in turn led to exceeding large and useless ratio values. By making the assumption that an n-gram is a keyword only if all its substrings are also keywords, I'm oversimplyis. But I can demonstrate its effectiveness with a few examples. Let's try "Berkeley": query=Berkeley cluster namecluster sizecorrect countcorrect percentage1students22100%2seti@home, june 2002, 2002, sponsors11100%3library, libraries33100%4library, uc, collections, web11100%5berkeley, information8787.5%6seismological laboratory, earthquake information, earthquake, berkeley11100%7web22100%8{unrelated/lonely}12 Of course, it easy to have high correct pe?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bq where 0 <= i < len(A), 0 <= j < len(B) total = total + keywordDistance(A(i), B(j)) return total / (len(A) * len(B)) Essentially, it's computing the average word-to-word distance for each pairs of keywords in the two sets. It has the property that documentDistance(A,B) == documentDistance(B,A). The effect of the Cluster-to-Cluster distance function is more subtle. I've experimented with two definitions. In the first one, I defined the inter-cluster instancehis function doesn't exhibit a "cliff", or a suddent drop, as n approaches 0. Let's search for "doom": query=doom cluster namecluster sizecorrect countcorrect percentage1doom, doom iii, 2002, iii161487.5%2science fiction fantasy horror, planet doom, sites, science fiction11100%3sitebad category4sgi doom, sgi version, version sgixdoom11100%532+32kwb asus tx97-e 430tx.... 2000, pv, pci11100%6armdeu 1.17, armdeu, wadptr 2.3, wadptr11100%7linux, 10%, 2.5%, freebsd12|$C9V/DXB: `!P$C9V/D (#GJ*xZ hU~wS R0C  ,6J\[Wm)6c(I0\*(EjR2)I (ZI#B? o=9wg}sy{''٘IY^XWz'yB.Bқ"K"8x~&2$NVKvEww7ơ}. 7#F4NL_ d[2O#3]3GdHņFc? [S|[ckSU܍yy|Mּ>IKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(.Dd 0  # A1100%8{unrelated/lonely}7 The cluster labeled "site" is a bad category because it doesn't mean anything. It would have been preferred if the "sgi doom" and "32+32kwb" (a benchmarking site) clusters are merged with the main "doom" (the idSoftware game) cluster. PEG was able to single out a cluster for a horror story, and an essay about linux being Microsoft's impending doom.   Related work in the literature There have been several text clustering systems developed in the past. Xerox PARC's Scatter/Gather implemented a similar system, but it didn't try to distill document features (i.e: keywords), and it used a straightforward cosine measure to define inter-document distances. Andrei Broder of DEC had also done some work in defining a inter- to be the distance of the two closest points between the two clusters. The advantage of this definition is that it allows for more "irregularly-shaped" clusters, but it's also less stable, leading to ill-separated clusters. The second definition is based on the average document-to-document distance between each pair of documents in the two clusters. The advantage of this definiton is robustness against linking two well separated large clusters, but the shapes of the clusters tend to be "spherical" (i.e: likes to expand uniformly in all direction). Fortunately, due to the relatively few data points (30-50), both definitions perform about identically. I employ the straightforward Hierarchical Clustering algorithm to group the documents, but exactly how do you determine the optimal number of clusters? We can compute the quality of a cluster by its "compactness", approximated by its average pair-wise document-to-document distances (think: clique). It's a measure of how well-separated clusters are to one another. If we define compact(n) to be the micro-average cluster compactness for having n clusters (chosen by the hierachical clustering algorithm), then it can be proven that it's an increasing function (decreases as n approaches 1); when n == number of documents, compact(n) = 1.0, but as soon as we start grouping different documents together, the overall compactness decreases, approaching 0.0. Naturally, we want clusters to be as tightly packed as possible, but it can also be argued to we prefer as few necessary groups as possible. My hope is that these two competing qualities will single out an optimal number of clusters. I do not know for certain how to model "as few necessary groups as possible", and without a training set, it's tough to guess thresholds. In the end after taking PEG for a few manual test drives, I decided to use a normal distribution, with a mean of 6.0 and variance of 5.0 (in units of clusters). These are arbitrary, but nevertheless not completely random. Afterwards, each cluster is labeled by its prominent keywords, where prominence is measured by the number of documents in the cluster containing a particular keyword. I choose the top four most prominent keywords as to control the length of a cluster label. Regrettably, I'm most dissatisfied with this part of the project where I have to determine the optimal number of clusters without any reasonable justification. There has to be a better, more logically-sound way to compute this function, but unfortunately I ran out of time. An EM approach maybe able to group and label clusters simultaneously. Stage 5: Generate HTML output for the end-user view With the clustered groups, I'm finally ready to generate HTML code for the search results page. This is straightforward, but there are a few features worth mentioning. First, because PEG goes out of its way to download web documents pointed by the search results, it is able to show precisely which URLs are unavailable at the moment (e.g: 404, 403 errors). Secondly, the result page includes the list of keywords associated with each document, which acts as a decent summary of the content ahead. Results and performance The problem at hand is too complex for a machine to automatically decide whether a particular cluster configuration is "good". Without a usable traininig set, tuning the system and experimenting with new ideas quickly became an exhaustive task of testing on a few hopefully representative test search queries and eyeballing results manually. Certainly, this is not the best approach in judging an NLP system. We start with the performance of the keyword extraction engine. With the default parameters, I computed the sets of keywords, ordered by importance, for the following sites (as of today): threshold=80, count=4 http://www.berkeley.edu/uc, berkeley, campus, sites, student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/u.s, cnn.com, cnn, videohttp://www.stanford.edu/class/cs224n/java, natural language processing, language processing, processing, assignments, assignment, gates, section, programming, pleasehttp://nlp.stanford.edu/~manning/+1 650, nlp, probabilistic, ling, linguistics, grammar, natural language processing, christopher manning, language processing, manning If I reduce the ratio threshold to 40, I will obtain the following additional keywords: threshold=40, count=4 http://www.berkeley.edu/http://www.stanford.edu/http://www.mit.edu/newshttp://www.cnn.com/sportshttp://www.stanford.edu/class/cs224n/final projecthttp://nlp.stanford.edu/~manning/ As you can see, there is not a whole lot of differences despite a significant reduction in the ratio threshold. This is good news because it means good keywords tend to be good at prevailing themselves from the norm. Let's reduce the count threshold, while returning the ratio threshold back to its original value: threshold=80, count=3 http://www.berkeley.edu/uc, berkeley, cal, campus, extension, sites, sessions, student, students, junehttp://www.stanford.edu/alumni, stanford university, stanfordhttp://www.mit.edu/mit, events, programshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, jagger, cnn, video, die, gashttp://www.stanford.edu/class/cs224n/ling, java, natural language processing, language processing, processing, directory, assignments, assignment, manning, gateshttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic, ling, science linguistics, linguistics, computational, grammar threshold=80, count=2 document distance function specifically targeted for web document. It used the mathematical notion of "resemblance" and "containment" (symmetric vs asymmetric relationship), the latter of which PEG does not attempt to account for. Instead of using keywords, it used shingles, which are essentially n-grams. Conclusion Because PEG employs Google as the underlying search engine, technically, the search results cannot be any worse than that of raw Google. PEG attempts to extract more information from the 2FJO~?v`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې'c3==L%UL%אOjۙihttp://www.berkeley.edu/2002, a-z, sessions uc extension, finder map, uc, undergraduate, berkeley, cal student, web sites, librarieshttp://www.stanford.edu/directories, alumni, stanford university, stanfordhttp://www.mit.edu/mit web, search mit, mitNP**\^ތd޸ޠި$$l $JLNv"$hjlnlhp$$l $Hfhjl2 @\<4 $$l $FBt: d$$l $:PRHhjlbd"HJN`dT$$ljm $$l $`dhrtxTX\fhl|ߨߌ$$ljm $@DHRTX`dhrtxL@l$$ljm $:`bf@DHRTXݤ@\$$ljm $Xb|~HLPZ\`ߘ$$ljm $:>@BDF      T V | ~ lnph$$ljm $, directory, events calendar, faculty staff, tech, campus, offices programs, massachusettshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, time.com, 2002, jagger knighted, mick jagger, in-depth, swamp netshttp://www.stanford.edu/class/cs224n/cs 224n ling 237 natural language processing, eval.pl script, e.g., hyponym, wmhome, jwordnet, stdin", ling, java, computationalhttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic models, probabilistic, hinrich sch tze, 2001, 200/ =!"#$%`!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqIKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې0, 1999 A lot more unimportant keywords seem to have cropped up when the threshold is set too low. Let's increase the count threshold: threshold=80, count=5 http://www.berkeley.edu/student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.e~5l8n8p888`<b<d<f<<<<< ====tuuuuu2v4vyyy$<<<<<=== ===uudufuhuuuuuuu"֊nPܧRnzzpDAA I IIIJJVVVVVVVV(W jvUj@ UVOJQJ5 6OJQJ H*OJQJ0JjuOJQJUjOJQJUOJQJ jnUj@ UV jU jfUB'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PKarWaii (Sammy) Sy 4893129  HYPERLINK mailto:sammysy@cs.stanford.edu sammysy@cs.stanford.edu June 9th, 2002 CS224n: Project Enhanced Google Problem to be investigated Most Internet search engines, given a search query, try to find web documents containing the terms listed in the query. However, because the search process treats words purely as semantic-less identifiers, these same search engines are unable to distinguish different word senses, not mentioning documents about different topical domains. Frequently, users must manually scan through a long list of interleaved results in order to zoom in onto the desired ones. A solution using an NLP approach One way to solve the problem is to figure out which one of the different word senses that a document assumes. For instance, let's say the term "Michael Jordan" carries two senses, the basketball player and the Berkeley professor. A document containing "Michael Jordan" will use the term to denote either the basketball player or the professor, but probably not both. This is an important property because we can rely on the set of word senses to categorize documents, and thereby solving the above problem. Our goal now has been reduced to defining such a function that maps the set of documents containing W to the set of W's word senses. Like all other NLP problems, however, the gut of this function is completely non-trivial. For example, givenlist of search results, and specifically, it tries to categorize documents based on their semantic similarity.  EMBED Excel.Sheet.8   EMBED Excel.Sheet.8   EMBED Excel.Sheet.8  ^^\^^0_z_*`t`v`x```ajflfnffVkXkjlllttuuPuzz| & F||}|}}}$́B6ąBz|THJ & F h a search query, how do we obtain the set of word senses? One possible solution is to use a dictionary like WordNet, but this approach falls apart quickly as soon as we have to deal with non-standard English terminologies, acronyms, compound nouns, etc. The space of all human-understandable terms is simply too huge. Moreover, word senses associated with a term can change over time, and thus maintaining a database of all sets of word senses would require too much work and human intervention. These problematic issues will be addressed more thoroughly in later sections of this report. My implemented solution involves inserting an extra processing layer between the users and a selected search engine, Google. This layer, which shall be referred to as Project Enhanced Google or PEG for short, will try to categorize search results returned by Google and present the categorized results neatly to the user. PEG's success will be judged by how well it catagorizes web documents, and how much convenience it brings to its users. PEG has a five-stage pipeline: Invoke the parent search engine to obtain a list of "raw" search results. Fetch and purify web documents. Extract features from web documents. Based on the extracted features, cluster the documents into groups by their similarity. Generate HTML for the end-user view. Detailed discussion of algorithms/method used Stage 1: Obtain raw search results I picked Google as the parent search engine because I like it myself, and because it allows search requests via a SOAP interface. By using the web service directly, I save the need to parse the usual HTML output manually while at the same time maintain future compatibility with this well defined interface. The plumbing behind this stage is of little NLP interest, and so I will reframe from elaborating further. Do keep in mind though that Google limits each licensee of the web service to 1000 requests per day. The quota is plentiful for obtaining search results, but I can't utilize Google's vast database to, for example, help examine keywords on a word-by-word basis. Stage 2: Fetch and purify web documents The Network API in Java is one of its strong points, so making HTTP requests to fetch web documents is straightforward. Originally, I was fetching documents sequentially (synchronous calls), but as soon as I started downloading 30 or 40 in one go, the total time spent just in this stage would exceed many minutes due to long socket timeouts and slow servers. To establash a soft upperbound in time requirement, I had to multithread this portion of the program to download multiple documents simultaneously, and upgrade to JDK 1.4 which provided an easy way to shorten socket timeouts. Some non-text document sources are deemed unprocessible by PEG. I exclude URLs with certain file extensions like .zip, .doc, .pdf, etc. The purification procedure is more interesting. I decided to convert everything to lower case. It's true that I will lose all information provided by word capitalization, but in return, the problem of data sparseness is alleviated, which I consider a more immediate problem. HTML documents on the web are not what you normally consider "clean" text. They often contain nasty junk like malformed XML syntax, javascript, symbols; basically, it's often the case that you don't find any clean paragraph of English sentences on a page. There's not much I can do about them, but I try my best using various adhoc heuristics, like filtering out tokens without any alphanumeric character, small integers, and stopwords. In addition, I try to purify each token by cropping away outer junk characters (e.g: punctuations, parenthesis). The purified text, judged by my eyes, is generally satisfactory and usable for all intents and purposes. Note that I throw away all information about sentence boundary, although the general word order is undisturbed. I learned that parsing web data is a huge pain in the neck. Stage 3: Extract features Extracting features from web documents turned out to be the core of the project and the most interesting problem from the NLP point of view. A web document contains a rich amount of information, from the text itself, URL, hyperlinks, HTML structure, to images. Being an NLP project, I decided to treat a web document simply as a sequence of words, without exploiting any of the informative HTML structure (e.g: title string) or hyperlinks. The quality of extracted features will suffer slightly, but at the same time the result of the project will reflect unambiguously the (in)effectiveness of NLP algorithms. Instead of using raw text as the feature set, I decided to distill it a bit by extracting keywords. "Keywords" of a document is defined to be a set of tokens that appear abnormally frequently in the document compared to the norm (i.e: normal English). Such a definition translates directly to a computable algorithm. Let count(w|doc) = { occurances of word w in document doc } Let count(doc) = { the number of tokens in doc } Hence, prob(w|doc) = count(w|doc) / count(doc) To model "normal English", I went to three years worth of New York Times articles from 1994 to 1996. 1.7 gigabytes of text is a ton of data, and in the end, I was able to digest approximately 240 million words, 640,000 of which are unique. I have to apply a primitive smoothing adjustment to handle unknown words, which are encounted often on the web. Let count(w|langauge) = { occurances of word w in the corpus } Let count(language) = { total number of tokens in the corpus } prob(w|language) = if count(w|language)==0, then return 0.5 / count(language) otherwise, return count(w|language) / count(language) Note that with this simple smoothing, prob(w|language) is no longer a valid probability mass function, but that is inconsequential as only the proportion is relevant in my calculation. Because count(language) is such a huge number, 0.5 is a decent guess of the number of occurances of unknown words. For each w in doc Let ratio = prob(w|doc) / prob(w|language) Let count = count(w|doc) w is a keyword of doc if ratio >= THRESHOLD_RATIO and count >= THRESHOLD_COUNT In English, the algorithm describes that a word is a keyword if the following two contraints are satisifed: The word appears abnormally frequent The word has at least repeated several times in the document The constraints make a lot of intuitive sense, and reflect the definition of keywords. The tricky part is deciding what to assign the two threshold values. As alluded in the introduction, I don't have access to a training set (general documents with labeled keywords), so it's impossible to train for threshold values aside from using my own judgement. This lack of a training set would continue to haunt me later from many other angles. Perhaps I should have placed more emphasis in locating a usable training set from the get go. It appears that THRESHOLD_RATIO=80 and THRESHOLD_COUNT=4 work decently well. Again, these values were picked without any theoretical backing, but most of the time the keyword selection is surprisingly accurate. Rarelydu/mithttp://www.cnn.com/cnn, videohttp://www.stanford.edu/class/cs224n/language processing, processing, section, naturalhttp://nlp.stanford.edu/~manning/probabilistic, linguistics, natural language processing, christopher manning, lanugage processing, manning, processing, text, language, natural It seems a count threshold of 5 (or higher) is too restrictive. Too many otherwise good and valid keywords are discarded. A systematic method of evaluating clustering results is not immediately obvious. But I can demonstrate its effectiveness with a few examples. Let's try "Berkeley": query=Berkeley cluster namecluster sizecorrect countcorrect percentage1students22100%2seti@home, june 2002, 2002, sponsors11100%3library, libraries33100%4library, uc, collections, web11100%5berkeley, information8787.5%6seismological laboratory, earthquake information, earthquake, berkeley11100%7web22100%8{unrelated/lonely}12 Of course, it easy to have high correct percentage if cluster size is small and the cluster group general enough. The documents in the {unrelated/lonely} cluster are either featureless, or unrelated to all other documents in the pool. Unfortunately, although the quality of individual clusters is decent, we would have preferred a fewer clusters. For example, the "students" and "berkeley, information" should form a larger cluster. Also, the number of unsorted documents fallen into the {unrelated} cluster is substantial. I mentioned that earlier that I evaluate cluster "compactness" by computing its average internal distance in a clique of documents configuration, how the compactness function is a decreasing function as the number of clusters approaches one. Here's an illustration:  Note that this function doesn't exhibit a "cliff", or a suddent drop, as n approaches 0. Let's search for "doom": query=doom cluster namecluster sizecorrect countcorrect percentage1doom, doom iii, 2002, iii161487.5%2science fiction fantasy horror, planet doom, sites, science fiction11100%3sitebad category4sgi doom, sgi version, version sgixdoom11100%532+32kwb asus tx97-e 430tx.... 2000, pv, pci11100%6armdeu 1.17, armdeu, wadptr 2.3, wadptr11100%7linux, 10%, 2.5%, freebsd11100%8{unrelated/lonely}7 The cluster labeled "site" is a bad category because it doesn't mean anything. It would have been preferred if the "sgi doom" and "32+32kwb" (a benchmarking site) clusters are merged with the main "doom" (the idSoftware game) cluster. PEG was able to single out a cluster for a horror story, and an essay about linux being Microsoft's impending doom.   Related work in the literature There have been several text clustering systems developed in the past. Xerox PARC's Scatter/Gather implemented a similar system, but it didn't try to distill document features (i.e: keywords), and it used a straightforward cosine measure to define inter-document distances. Andrei Broder of DEC had also done some work in defining a inter-document distance function specifically targeted for web document. It used the mathematical notion of "resemblance" and "containment" (symmetric vs asymmetric relationship), the latter of which PEG does not attempt to account for. Instead of using keywords, it used shingles, which are essentially n-grams. Conclusion Because PEG employs Google as the underlying search engine, technically, the search results cannot be any worse than that of raw Google. PEG attempts to extract more information from the list of search results, and specifically, it tries to categorize documents based on their semantic similarity. Keywords extracted have turn out to be an acceptable feature set to identify web documents, and as a bonus, it acts as a quick summary of a web document that users of PEG might find useful.  EMBED Excel.Sheet.8   EMBED Excel.Sheet.8   EMBED Excel.Sheet.8  d$i&innrr\rr0szs*tttvtxtttujzlznzzVXjl & FP|ܑޑ$̕B6ęBz|TTHJNP ̹,zĺp & F h  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~NDz|8:nplnp"\8^` & F h & F hp`D| \2FH.0NP* ܌d$$l $  * \ ^              J L N v      lhp$$$l  "$hjlnHfhjl\<$$l $2 @FB4  $$l $JNP ̥,zĦNDzpz|8:nplnp"\8^`D| & F h & F h| \2FH.0NP**\^܌dܸܠ$$l $JLNv"$hjlhp$$l $jlnHfhjl2\<$$$l 2 @F  B t     4  $$l $     : P R    H h j l b d "Hdd$$$l omZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PDyK sammysy@cs.stanford.eduyK >mailto:sammysy@cs.stanford.eduDd 0  # Bt:PRHh j l b!d!"""""dd$$l $"#"#H#J#N#`#d#h#r#t#x#######$$$$$$T$X$\$T|ߨ$$ljm $\$f$h$l$$$$$$$@%D%H%R%T%X%`%d%h%r%t%x%%%%%%ߌL@l$$ljm $ does the algorithm fail disastrously. The decision of whether a word is a keyword is not entirely binary. Some keywords are naturally more relevant than others in a document. For example, given all else equal, a word that appears twice as frequently as the other word is obviously more likely to carry more relevance. Similarly, a word with a higher ratio should be considered more relevant. I use a simple scheme in computing this "importance" value. importance(keyword) = ratio(keyword) * count(keyword) Because the ratio equation is divisional, it can easily overwhelm the count value. Therefore, an unknown word may end up grabbing a huge share of importance, which arguably is the correct course of action. By not using raw text as features, I avoid potential problems comparing two documents with highly skewed lengths. As long as the number of keywords for each document is controlled, comparison among documents will not run into unnecessary biases towards or against long or short documents. As a protective measure, I keep only the top 10 (MAX_KEYWORDS) most important keywords to set a hard ceiling on the size of the feature set. So far, I have considered locating keywords which are unigrams, but there is a big problem: a keyword often comes packaged as a bigram, trigram, or an n-gram in general. For example, among many other things, names of people are bigrams or trigrams. Also, for many compound terms, the "worth" of the whole is greater than the sum of its individual constituents. "Operating System" means something special when used together, for instance. Indeed, the problem of finding collocated keywords is real and crucial. How should I extend the algorithm? The first question I asked was whether it is feasible to include more information with my language model. I thought about building a bigram model, which would be huge in size due to sheer number of unique bigrams expected to be found. If the space of unigram is barely managable, then the space of bigrams is certainly unthinkable. The problem of data sparseness with a bigram model will simply balloon to the point of being unusable. In fact, the number of false positives for bigram keywords is too great to justify the extra space (HUGE!) needed for the bigram model. Not only that, a bigram model still doesn't scale well to deal with n-gram keywords regardless. Then I started thinking inductively. If I make one plausible assumption, I can collect collocated keywords relatively straightforwardly: an n-gram is a keyword only if all its substrings of length (n-1) are also keywords. For example, if "operating" and "system" have both established themselves as keywords, then "operating system" and "system operating" are both candidates to become bigram keywords. I will enforce an additional constraint where the n-gram keyword must also have appeared at least a certain number of times to qualify. By the way, this algorithm was inspired by the A-Priori algorithm for computing frequent item sets. Recursive algorithm: An n-gram w is a keyword if: If (n==1) then, return true if w is a unigram keyword; return false otherwise. Otherwise, Let left = w[0,len-2] // substring Let right = w[1,len-1] Let count = count(w|doc) return true if both left and right are keywords and count >= THRESHOLD_COUNT return false otherwise. Note that I'm not accounting for the ratio factor which I used in judging unigram keywords. The reason is that prob(n-gram|language) is a tough one to define with just a unigram frequency language model. The Naive Bayes Assumption, which worked admirably in the word sense disambiguation project, was a complete failure here. Given the immense size of the corpus, prob(unigram|language) was already tiny. Applying the Naive Bayes Assumption, a series of successive multiplications, led to microscopic prob(n-gram|language), which in turn led to exceeding large and useless ratio values. By making the assumption that an n-gram is a keyword only if all its substrings are also keywords, I'm oversimplying the space of n-gram keywords. A serious shortcoming is true negatives on n-gram keywords which contain a common word. For instance, "My Little Lover", a popular band in Japan, will never be collected as a trigram keyword because "my" is a common/stop %%))+++,,,,,,--:-`-b-f-------@.ݤ@$$ljm $@.D.H.R.T.X.b.|.~..........H/L/P/Z/\/`////\$$ljm $A2cw~Qk +2 `!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bq0@0B0D0F03 33333T3V3|5~5ߘh$$ljm $r|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PDyK sammysy@cs.stanford.eduyK >mailto:sammysy@cs.stanford.eduDd 0  # A2cw~Qk +2 `!cw~Qk +$word, and hence not a keyword by itself. But other than that, many quality compound keywords are successfully assembled from the list of unigram keywords. The last part of the keyword extraction phase is removing duplicated (n-1)-gram keywords that have been used to construct n-gram keywords. This step is to avoid having "natural language" and "language processing" as bigram keywords whenever "natural language processing" is a keyword. The removal algorithm is also inductive in nature (and somewhat complex), but a nice property it has is that if "linux operating system" is a keyword, it is still possible for "linux" to be a keyword by itself provided it's appeared by itself enough. Although I didn't get enough time to experiment with other keyword extraction algorithms, I thought about using other cues to improve results. For one, it would be nice if each word in the text had been tagged for its part of speech. A general observation is that keywords are most often nouns or compound nouns. Or to think of this from the opposite direction, it is rare to find adverbs and adjectives as keywords, intuitively. The presence of part of speech tags would allow an additional constraint. Actually, an machine-accessible dictionary with part of speech listing for a word could be used to approximate this "keyword is a noun" constraint as well. Secondly, collocated keywords maybe "verified" by consulting Google's search database. If an n-gram is found to appear in a large number of web documents, then likely such an n-gram is a valid keyword. However, the limited number of Google web service requests disallowed this trick. On well-behaving documents, ones with plenty of well-formed English sentences, the result of keyword extraction is spectacular (more supporting data to follow). However, there are pages out on the web that are entirely Flash-based, image-based, or frame-based. In those cases, no keywords will be found, and such a document will become featureless. This is a huge problem, as PEG won't be able to do anything about these misbehaving webpages at all. Stage 4: Automatically cluster web documents by keywords Given keywords as document features, we can prepare the space of documents for unsupervised clustering. Our goal is two-fold: Group similar documents together Find well-separated clusters Each point in this document space is and completely and uniquely identified by its set of keywords. This is a non-Euclidean space without the normal notion of n-dimensional coordinates, but a distance (similarity) function between two documents is definable. There are three types of distance functions that I must define: Keyword-to-Keyword Document-to-Document Cluster-to-Cluster For the first type of distance functions, a straightforward definition is the string equality test. Two keywords are equal if and only if the strings match completely. At first glance, this definition seems logical, but unfortunately, it doesn't handle n-gram keywords fairly. For instance, "linux" and "linux kernel", which is semantically similar, will be unfairly marked as dissimilar. To deal with n-gram keywords, I must allow fractional/partial similarity. Again, I do not know the theoretically optimal distance function in this case, but here's a logical option: keywordDistance(A,B) Let a be a substring of A Let b be a substring of B such that a == b, and len(a)=len(b) is maximized Let intersect = len(a) Let union = len(A) + len(B)  intersect return intersect / union This definition resembles the Jaccard measure. Some examples: distance("a b", "a") = 0.5, distance("a b c", "b c d") = 0.5, distance("a b c", "a c") = 0.33. Other more indirect metrics may be used to compare keywords. For example, I could try to use WordNet's synonym listing to approximate the conceptual distance between terms representing similar ideas, like how "ship" is close the meaning of "boat". However, WordNet's database most of the time was inadaquate in handling non-standard terms, like "Oracle" as a software company., or n-gram keywords. Normally, we can use the Jaccard or Cosine measure to compute document-to-document distance function. However, with fractional keyword equality, I have to abandon the Jaccard measure because the set of keywords is no longer a set of distinct objects. documentDistance(A,B) total = 0 For each <i,j> where 0 <= i < len(A), 0 <= j < len(B) total = total + keywordDistance(A(i), B(j)) return total / (len(A) * len(B)) Essentially, it's computing the average word-to-word distance for each pairs of keywords in the two sets. It has the property that documentDistance(A,B) == documentDistance(B,A). The effect of the Cluster-to-Cluster distance function is more subtle. I've experimented with two definitions. In the first one, I defined the inter-cluster instance to be the distance of the two closest points between the two clusters. The advantage of this definition is that it allows for more "irregularly-shaped" clusters, but it's also less stable, leading to ill-separated clusters. The second definition is based on the average document-to-document distance between each pair of documents in the two clusters. The advantage of this definiton is robustness against linking two well separated large clusters, but the shapes of the clusters tend to be "spherical" (i.e: likes to expand uniformly in all direction). Fortunately, due to the relatively few data points (30-50), both definitions perform about identically. I employ the straightforward Hierarchical Clustering algorithm to group the documents, but exactly how do you determine the optimal number of clusters? We can compute the quality of a cluster by its "compactness", approximated by its average pair-wise document-to-document distances (think: clique). It's a measure of how well-separated clusters are to one axZ hU~wS R0C  ,6J\[Wm)6c(I0\*(EjR2)I (ZI#B? o=9wg}sy{''٘IY^XWz'yB.Bқ"K"8x~&2$NVKvEww7ơ}. 7#F4NL_ d[2O#3]3GdHņFc? [S|[ckSU܍yy|Mּ>IKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(.Dd 0  # A2FJO~fWv`!~FJO~ #GJ*LITIVILLOOQQQ0Q2QVVVVVVVVV0W2WhWjWlWnW$(W*W,W.W2W4W`WbWdWfWlWnWdfh"֤nPRnzzp D..[[ c cccddqq.q0q2q4q۽۷۵۵۰۷۰۰۰۰۵۰۰۷ j[Uj@ UVOJQJ5 6OJQJ H*OJQJ0JjxOJQJUjOJQJUOJQJ jJUj@ UV jU jJ~Uj4@ UV@'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PKarWaii (Sammy) Sy 4893129  HYPERLINK mailto:sammysy@cs.stanford.edu sammysy@cs.stanford.edu June 9th, 2002 CS224n: Project Enhanced Google Prnother. If we define compact(n) to be the micro-average cluster compactness for having n clusters (chosen by the hierachical clustering algorithm), then it can be proven that it's an increasing function (decreases as n approaches 1); when n == number of documents, compact(n) = 1.0, but as soon as we start grouping different documents together, the overall compactness decreases, approaching 0.0. Naturally, we want clusters to be as tightly packed as possible, but it can also be argued to we prefer as few necessary groups as possible. My hope is that these two competing qualities will single out an optimal number of clusters. I do not know for certain how to model "as few necessary groups as possible", and without a training set, it's tough to guess thresholds. In the end after taking PEG for a few manual test drives, I decided to use a normal distribution, with a mean of 6.0 and variance of 5.0 (in units of clusters). These are arbitrary, but nevertheless not completely random. Afterwards, each cluster is labeled by its prominent keywords, where prominence is measured by the number of documents in the cluster containing a particular keyword. I choose the top four most prominent keywords as to control the length of a cluster label. Regrettably, I'm most dissatisfied with this part of the project where I have to determine the optimal number of clusters without any reasonable justification. There has to be a better, more logically-sound way to compute this function, but unfortunately I ran out of time. An EM approach maybe able to group and label clusters simultaneously. Stage 5: Generate HTML output for the end-user view With the clustered groups, I'm finally ready to generate HTML code for the search results page. This is straightforward, but there are a few features worth mentioning. First, because PEG goes out of its way to download web documents pointed by the search results, it is able to show precisely which URLs are unavailable at the moment (e.g: 404, 403 errors). Secondly, the result page includes the list of keywords associated with each document, which acts as a decent summary of the content ahead. Results and performance The problem at hand is too complex for a machine to automatically decide whether a particular cluster configuration is "good". Without a usable traininig set, tuning the system and experimenting with new ideas quickly became an exhaustive task of testing on a few hopefully representative test search queries and eyeballing results manually. Certainly, this is not the best approach in judging an NLP system. We start with the performance of the keyword extraction engine. With the default parameters, I computed the sets of keywords, ordered by importance, for the following sites (as of today): threshold=80, count=4 http://www.berkeley.edu/uc, berkeley, campus, sites, student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/u.s, cnn.com, cnn, videohttp://www.stanford.edu/class/cs224n/java, natural language processing, language processing, processing, assignments, assignment, gates, section, programming, pleasehttp://nlp.stanford.edu/~manning/+1 650, nlp, probabilistic, ling, linguistics, grammar, natural language processing, christopher manning, language processing, manning If I reduce the ratio threshold to 40, I will obtain the following additional keywords: threshold=40, count=4 http://www.berkeley.edu/http://www.stanford.edu/http://www.mit.edu/newshttp://www.cnn.com/sportshttp://www.stanford.edu/class/cs224n/final projecthttp://nlp.stanford.edu/~manning/ As you can see, there is not a whole lot of differences despite a significant reduction in the ratio threshold. This is good news because it means good keywords tend to be good at prevailing themselves from the norm. Let's reduce the count threshold, while returning the ratio threshold back to its original value: threshold=80, count=3 http://www.berkeley.edu/uc, berkeley, cal, campus, extension, sites, sessions, student, students, junehttp://www.stanford.edu/alumni, stanford university, stanfordhttp://www.mit.edu/mit, events, programshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, jagger, cnn, video, die, gashttp://www.stanford.edu/class/cs224n/ling, java, natural language processing, language processing, processing, directory, assignments, assignment, manning, gateshttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic, ling, science linguistics, linguistics, computational, grammar threshold=80, count=2 http://www.berkeley.edu/2002, a-z, sessions uc extension, finder map, uc, undergraduate, berkeley, cal student, web sites, librarieshttp://www.stanford.edu/directories, alumni, stanford university, stanfordhttp://www.mit.edu/mit web, search mit, mit, directory, events calendar, faculty staff, tech, campus, offices programs, massachusettshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, time.com, 2002, jagger knighted, mick jagger, in-depth, swamp netshttp://www.stanford.edu/class/cs224n/cs 224n ling 237 natural language processing, eval.pl script, e.g., hyponym, wmhome, jwordnet, stdin", ling, java, computationalhttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic models, probabilistic, hinrich sch tze, 2001, 2000, 1999 A lot more unimportant keywords seem to have cropped up when the threshold is set too low. Let's increase the count threshold: threshold=80, count=5 http://www.berkeley.edu/student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.eoblem to be investigated Most Internet search engines, given a search query, try to find web documents containing the terms listed in the query. However, because the search process treats words purely as semantic-less identifiers, these same search engines are unable to distinguish different word senses, not mentioning documents about different topical domains. Frequently, users must manually scan through a long list of interleaved results in order to zoom in onto the desired ones. A solution using an x͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝdu/mithttp://www.cnn.com/cnn, videohttp://www.stanford.edu/class/cs224n/language processing, processing, section, naturalhttp://nlp.stanford.edu/~manning/probabilistic, linguistics, natural language processing, christopher manning, lanugage proceNLP approach One way to solve the problem is to figure out which one of the different word senses that a document assumes. For instance, let's say the term "Michael Jordan" carries two senses, the basketball player and the Berkeley professor. A document containing "Michael Jordan" will use the term to denote either the basketball player or the professor, but probably not both. This is an important property because we can rely on the set of word senses to categorize documents, and thereby solving the above problem. Our goal now has been reduced to defining such a function that maps the set of documents containing W to the set of W's word senses. Like all other NLP problems, however, the gut of this function is completely non-trivial. For example, givenHJN`dhrtxTX\fhlT|$$$ljm l@DHRTX`dhrtxL@l$$ljm $:`bf@DHݤ@$$ljm $HRTXb|~HLPZ\`\ߘ$$ljm $:>@BDF TV|!~!l$n$h$$ljm $ssing, manning, processing, text, language, natural It seems a count threshold of 5 (or higher) is too restrictive. Too many otherwise good and valid keywords are discarded. A systematic method of evaluating clustering results is not immediately obvious. But I can demonstrate its effectiveness with a few examples. Let's try "Berkeley": query=Berkeley cluster namecluster sizecorrect countcorrect percentage1students22100%2seti@home, june 2002, 2002, sponsors11100%3library, libraries/ =!"#$%`!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqIKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې33100%4library, uc, collections, web11100%5berkeley, information8787.5%6seismological laboratory, earthquake information, earthquake, berkeley11100%7web22100%8{unrelated/lonely}12 Of course, it easy to have high correct pe a search query, how do we obtain the set of word senses? One possible solution is to use a dictionary like WordNet, but this approach falls apart quickly as soon as we have to deal with non-standard English terminologies, acronyms, compound nouns, etc. The space of all human-understandable terms is simply too huge. Moreover, word senses associated with a term can change over time, and thus maintaining a database of all sets of word senses would require too much work and human intervention. These problematic issues will be addressed more thoroughly in later sections of this report. My implemented solution involves inserting an extra processing layer between the users and a selected search engine, Google. This layer, which shall be referred to as Project Enhanced Google or PEG for short, will try to categorize search results returned by Google and present the categorized results neatly to the user. PEG's success will be judged by how well it catagorizes web documents, and how much convenience it brings to its users. PEG has a five-stage pipeline: Invoke the parent search engine to obtain a list of "raw" search results. Fetch and purify web documents. Extract features from web documents. Based on the extracted features, cluster the documents into groups by their similarity. Generate HTML for the end-user view. Detailed discussion of algorithms/method used Stage 1: Obtain raw search results I picked Google as the parent search engine because I like it myself, and because it allows search requests via a SOAP interface. By using the web service directly, I save the need to parse the usual HTML output manually while at the same time maintain future compatibility with this well defined interface. The plumbing behind this stage is of little NLP interest, and so I will reframe from elaborating further. Do keep in mind though that Google limits each licensee of the web service to 1000 requests per day. The quota is plentiful for obtaining search results, but I can't utilize Google's vast database to, for example, help examine keywords on a word-by-word basis. Stage 2: Fetch and purify web documents The Network API in Java is one of its strong points, so making HTTP requests to fetch web documents is straightforward. Originally, I was fetching documents sequentially (synchronous calls), but as soon as I started downloading 30 or 40 in one go, the total time spent just in this stage would exceed many minutes due to long socket timeouts and slow servers. To establash a soft upperbound in time requirement, I had to multithread this portion of the program to download multiple documents simultaneously, and upgrade to JDK 1.4 which provided an easy way to shorten socket timeouts. Some non-text document sources are deemed unprocessible by PEG. I exclude URLs with certain file extensions like .zip, .doc, .pdf, etc. The purification procedure is more interesting. I decided to convert everything to lower case. It's true that I will lose all information provided by word capitalization, but in return, the problem of data sparseness is alleviated, which I consider a more immediate problem. HTML documents on the web are not what you normally consider "clean" text. They often contain nasty junk like malformed XML syntax, javascript, symbols; basically, it's often the case that you don't find any clean paragraph of English sentences on a page. There's not much I can do about them, but I try my best using various adhoc heuristics, like filtering out tokens without any alphanumeric character, small integers, and stopwords. In addition, I try to purify each token by cropping away outer junk characters (e.g: punctuations, parenthesis). The purified text, judged by my eyes, is generally satisfactory and usable for all intents and purposes. Note that I throw away all information about sentence boundary, although the general word order is undisturbed. I learned that parsing web data is a huge pain in the neck. Stage 3: Extract features Extracting features from web documents turned out to be the core of the project and the most interesting problem from the NLP point of view. A web document contains a rich amount of information, from the text itself, URL, hyperlinks, HTML structure, to images. Being an NLP project, I decided to treat a web document simply as a sequence of words, without exploiting any of the informative HTML structure (e.g: title string) or hyperlinks. The quality of extracted features will suffer slightly, but at the same time the result of the project will reflect unambiguously the (in)effectiveness of NLP algorithms. Instead of using raw text as the feature set, I decided to distill it a bit by extracting keywords. "Keywords" of a document is defined to be a set of tokens that appear abnormally frequently in the document compared to the norm (i.e: normal English). Such a definition translates directly to a computable algorithm. Let count(w|doc) = { occurances of word w in document doc } Let count(doc) = { the number of tokens in doc } Hence, prob(w|doc) = count(w|doc) / count(doc) To model "normal English", I went to three years worth of New York Times articles from 1994 to 1996. 1.7 gigabytes of text is a ton of data, and in the end, I was able to digest approximately 240 million words, 640,000 of which are unique. I have to apply a primitive smoothing adjustment to handle unknown words, which are encounted often on the web. Let count(w|langauge) = { occurances of word w in the corpus } Let count(language) = { total number of tokens in the corpus } prob(w|language) = if count(w|language)==0, then return 0.5 / count(language) otherwise, return count(w|language) / count(language) Note that with this simple smoothing, prob(w|language) is no longer a valid probability mass function, but that is inconsequential as only the proportion is relevant in my calculation. Because count(language) is such a huge number, 0.5 is a decent guess of the number of occurances of unknown words. For each w in doc Let ratio = prob(w|doc) / prob(w|language) Let count = count(w|doc) w is a keyword of doc if ratio >= THRESHOLD_RATIO and count >= THRESHOLD_COUNT In English, the algorithm describes that a word is a keyword if the following two contraints are satisifed: The word appears abnormally frequent The word has at least repeated several times in the document The constraints make a lot of intuitive sense, and reflect the definition of keywords. The tricky part is deciding what to assign the two threshold values. As alluded in the introduction, I don't have access to a training set (general documents with labeled keywords), so it's impossible to train for threshold values aside from using my own judgement. This lack of a training set would continue to haunt me later from many other angles. Perhaps I should have placed more emphasis in locating a usable training set from the get go. It appears that THRESHOLD_RATIO=80 and THRESHOLD_COUNT=4 work decently well. Again, these values were picked without any theoretical backing, but most of the time the keyword selection is surprisingly accurate. Rarely does the algorithm fail disastrously. The decision of whether a word is a keyword is not entirely binary. Some keywords are naturally more relevant than others in a document. For example, given all else equal, a word that appears twice as frequently as the other word is obviously more likely to carry more relevance. Similarly, a word with a higher ratio should be considered more relevant. I use a simple scheme in computing this "importance" value. importance(keyword) = ratio(keyword) * count(keyword) Because the ratio equation is divisional, it can easily overwhelm the count value. Therefore, an unknown word may end up grabbing a huge share of importance, which arguably is the correct course of action. By not using raw text as features, I avoid potential problems comparing two documents with highly skewed lengths. As long as the number of keywords for each document is controlled, comparison among documents will not run into unnecessary biases towards or against long or short documents. As a protective measure, I keep only the top 10 (MAX_KEYWORDS) most important keywords to set a hard ceiling on the size of the feature set. So far, I have considered locating keywords which are unigrams, but there is a big problem: a keyword often comes packaged as a bigram, trigram, or an n-gram in general. For example, among many other things, names of people are bigrams or trigrams. Also, for many compound terms, the "worth" of the whole is greater than the sum of its individual constituents. "Operating System" means something special when used together, for instance. Indeed, the problem of finding collocated keywords is real and crucial. How should I extend the algorithm? The first question I asked was whether it is feasible to include more information with my language model. I thought about building a bigram model, which would be huge in size due to sheer number of unique bigrams expected to be found. If the space of unigram is barely managable, then the space of bigrams is certainly unthinkable. The prrcentage if cluster size is small and the cluster group general enough. The documents in the {unrelated/lonely} cluster are either featureless, or unrelated to all other documents in the pool. Unfortunately, although the quality of individual clusters is decent, we would have preferred a fewer clusters. For example, the "students" and "berkeley, information" should form a larger cluster. Also, the number of unsorted documents fallen into the {unrelated} cluster is substantial. I mentioned that earlier that I evaluate cluster "compactness" by computing its average internal distance in a clique of documents configuration, how the compactness function is a decreasing function as the number of clusters approaches one. Here's an illustration:  Note that this function doesn't exhibit a "cliff", or a suddent drop, as n approaches 0. Let's search for "doom": query=doom cluster namecluster sizecorrect countcorrect percentage1doom, doom iii, 2002, iii161487.5%2science fiction fantasy horror, planet doom, sites, science fiction11100%3sitebad category4sgi doom, sgi version, version sgixdoom11100%532+32kwb asus tx97-e 430tx.... 2000, pv, pci11100%6armdeu 1.17, armdeu, wadptr 2.3, wadptr11100%7linux, 10%, 2.5%, freebsd11100%8{unrelated/lonely}7 The cluster labeled "site" is a bad category because it doesn't mean anything. It would have been preferred if the "sgi doom" and "32+32kwb" (a benchmarking site) clusters are merged with the main "doom" (the idSoftware game) cluster. PEG was able to single out a cluster for a horror story, and an essay about linux being Microsoft's impending doom.   Related work in the literature There have been several text clustering systems developed in the past. Xerox PARC's Scatter/Gather implemented a similar system, but it didn't try to distill document features (i.e: keywords), and it used a straightforward cosine measure to define inter-document distances. (http:///www2.parc.com/istl/projects/ia/papers/sg-sigir92/sigir92.html) Andrei Broder of DEC had also done some work in defining a inter-document distance function specifically targeted for web document. It used the mathematical notion of "resemblance" and "containment" (symmetric vs asymmetric relationship), the latter of which PEG does not attempt to account for. Instead of using keywords, it used shingles, which are essentially n-grams. (http://www.scope.gmd.de/info/www6/technical/paper205/paper205.html) The "WiseNut. Search. Exactly" search engine seems to perform web document categorization as well, but unfortunately, no technical information is present on the website. (http://www.wisenut.com/) Conclusion Because PEG employs Google as the underlying search engine, technically, the search results cannot be any worse than that of raw Google. PEG attempts to extract more information from the list of search results, and specifically, it tries to categorize documents based on their semantic similarity. Keywords extracted have turn out to be an acceptable feature set to identify web documents, and as a bonus, it acts as a quick summary of a web document that users of PEG might find useful. The clustering algorithm is definitely suboptimal, and more work on defining better and more semantically-sensitive distance/similarity functions would certainly improve the performance of the system substantially.  EMBED Excel.Sheet.8   EMBED Excel.Sheet.8   EMBED Excel.Sheet.8  yzz$&\0z*tvxԊ֊jlnVXjl & FP|ܧާ$̫B6įBzz|THJNP ,zp & F hNDz|8:nplnp"\8 & F h & F hp8^`D| \2FH. & F hqvs      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnoprtuwxyz{|}~.0NP*܌$$l $*\^   """"""###J#L#N#v###dlhp$$$l ####$"$$$h$j$l$n$&&'H'''(f(h(((((j)l))\$$l $)***+++,2, --@----...///11F1222<4  $$l $233B3t333333444:4P4R4455H5h6j6l6b7d788dd$$l $88889"9H9J9N9`9d9h9r9t9x9999999:::::T|ݨ$$ljm $oblem of data sparseness with a bigram model will simply balloon to the point of being unusable. In fact, the number of false positives for bigram keywords is too great to justify the extra space (HUGE!) needed for the bigram model. Not only that, a bigram model still doesn't scale well to deal with n-gram keywords regardless. Then I started thinking inductively. If I make one plausible assumption, I can collect collocated keywords relatively straightforwardly: an n-gram is a keyword only if all its substrings of length (n-1) are also keywords. For example, if "operating" and "system" have both established themselves as keywords, then "operating system" and "system operating" are both candidates to become bigram keywords. I will enforce an additional constraint where the n-gram keyword must also have appeared at least a certain number of times to qualify. By the way, this algorithm was inspired by the A-Priori algorithm for computing frequent item sets. Recursive algorithm: An n-gram w is a keyword if: If (n==1) then, return true if w is a unigram keyword; return false otherwise. Otherwise, Let left = w[0,len-2] // substring Let right = w[1,len-1] Let count = count(w|doc) return true if both left and right are keywords and count >= THRESHOLD_COUNT return false otherwise. Note that I'm not accounting for the ratio factor which I used in judging unigram keywords. The reason is that prob(n-gram|language) is a tough one to define with just a unigram frequency language model. The Naive Bayes Assumption, which worked admirably in the word sense disambiguation project, was a complete failure here. Given the immense size of the corpus, prob(unigram|language) was already tiny. Applying the Naive Bayes Assumption, a series of successive multiplications, led to microscopic prob(n-gram|language), which in turn led to exceeding large and useless ratio values. By making the assumption that an n-gram is a keyword only if all its substrings are also keywords, I'm oversimplying the space of n-gram keywords. A serious shortcoming is true negatives on n-gram keywords which contain a common word. For instance, "My Little Lover", a popular band in Japan, will never be collected as a trigram keyword because "my" is a common/stop ::T:X:\:f:h:l:::::::@;D;H;R;T;X;`;d;h;r;t;x;;ߌL@l$$ljm $;;;;;;??AAABBBBBBCC:C`CbCfCCCCߤ$$ljm $#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqF@FBFDFFFI IIIIIߘh$$ljm $<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PDword, and hence not a keyword by itself. But other than that, many quality compound keywords are successfully assembled from the list of unigram keywords. The last part of the keyword extraction phase is removing duplicated (n-1)-gram keywords that have been used to construct n-gram keywords. This step is to avoid having "natural language" and "language processing" as bigram keywords whenever "natural language processing" is a keyword. The removal algorithm is also inductive in nature (and somewhat complex), but a nice property it has is that if "linux operating system" is a keyword, it is still possible for "linux" to be a keyword by itself provided it's appeared by itself enough. Although I didn't get enough time to experiment with other keyword extraction algorithms, I thought about using other cues to improve results. For one, it would be nice if each word in the text had been tagged for its part of speech. A general observation is that keywords are most often nouns or compound nouns. Or to think of this from the opposite direction, it is rare to find adverbs and adjectives as keywords, intuitively. The presence of part of speech tags would allow an additional constraint. Actually, an machine-accessible dictionary with part of speech listing for a word could be used to approximate this "keyword is a noun" constraint as well. Secondly, collocated keywords maybe "verified" by consulting Google's search database. If an n-gram is found to appear in a large number of web documents, then likely such an n-gram is a valid keyword. However, the limited number of Google web service requests disallowed this trick. On well-behaving documents, ones with plenty of well-formed English sentences, the result of keyword extraction is spectacular (more supporting data to follow). However, there are pages out on the web that are entirely Flash-based, image-based, or frame-based. In those cases, no keywords will be found, and such a document will become featureless. This is a huge problem, as PEG won't be able to do anything about these misbehaving webpages at all. Stage 4: Automatically cluster web documents by keywords Given keywords as document features, we can prepare the space of documents for unsupervised clustering. Our goal is two-fold: Group similar documents together Find well-separated clusters Each point in this document space is and completely and uniquely identified by its set of keywords. This is a non-Euclidean space without the normal notion of n-dimensional coordinates, but a distance (similarity) function between two documents is definable. There are three types of distance functions that I must define: Keyword-to-Keyword Document-to-Document Cluster-to-Cluster For the first type of distance functions, a straightforward definition is the string equality test. Two keywords are equal if and only if the strings match completely. At first glance, this definition seems logical, but unfortunately, it doesn't handle n-gram keywords fairly. For instance, "linux" and "linux kernel", which is semantically similar, will be unfairly marked as dissimilar. To deal with n-gram keywords, I must allow fractional/partial similarity. Again, I do not know the theoretically optimal distance function in this case, but here's a logical option: keywordDistance(A,B) Let a be a substring of A Let b be a substring of B such that a == b, and len(a)=len(b) is maximized Let intersect = len(a) Let union = len(A) + len(B)  intersect return intersect / union This definition resembles the Jaccard measure. Some examples: distance("a b", "a") = 0.5, distance("a b c", "b c d") = 0.5, distance("a b c", "a c") = 0.33. Other more indirect metrics may be used to compare keywords. For example, I could try to use WordNet's synonym listing to approximate the conceptual distance between terms representing similar ideas, like how "ship" is close the meaning of "boat". However, WordNet's database most of the time was inadaquate in handling non-standard terms, like "Oracle" as a software company., or n-gram keywords. Normally, we can use the Jaccard or Cosine measure to compute document-to-document distance function. However, with fractional keyword equality, I have to abandon the Jaccard measure because the set of keywords is no longer a set of distinct objects. documentDistance(A,B) total = 0 For each <i,j> where 0 <= i < len(A), 0 <= j < len(B) total = total + keywordDistance(A(i), B(j)) return total / (len(A) * len(B)) Essentially, it's computing the average word-to-word distance for each pairs of keywords in the two sets. It has the property that documentDistance(A,B) == documentDistance(B,A). The effect of the Cluster-to-Cluster distance function is more subtle. I've experimented with two definitions. In the first one, I defined the inter-cluster instance to be the distance of the two closest points between the two clusters. The advantage of this definition is that it allows for more "irregularly-shaped" clusters, but it's also less stable, leading to ill-separated clusters. The second definition is baseyK sammysy@cs.stanford.eduyK >mailto:sammysy@cs.stanford.eduDd 0  # A2cw~Qk +2 `!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{d on the average document-to-document distance between each pair of documents in the two clusters. The advantage of this definiton is robustness against linking two well separated large clusters, but the shapes of the clusters tend to be "spherical" (i.e:{''٘IY^XWz'yB.Bқ"K"8x~&2$NVKvEww7ơ}. 7#F4NL_ d[2O#3]3GdHņFc? [S|[ckSU܍yy|Mּ>IKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(.Dd 0  # A2FJO~nv`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PKarWaii (Sammy) Sy 4893129  HYPERLINK mailto:sammysy@cs.stanford.edu sammysy@cs.stanford.edu June 9th, 2002 CS224n: Project Enhanced Google Prߞ=sΙ33wV DR4,_1-yDe+*+G3͙+ٚ|^Vqߞ߇L}-g*Kna q(cP>IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMwdu/mithttp://www.cnn.com/cnn, videohttp://www.stanford.edu/class/cs224n/language processing, processing, section, naturalhttp://nlp.stanford.edu/~manning/probabilistic, linguistics, natural language processing, christopher manning, lanugage proceoblem to be investigated Most Internet search engines, given a search query, try to find web documents containing the terms listed in the query. However, because the search process treats words purely as semantic-less identifiers, these same search engines are unable to distinguish different word senses, not mentioning documents about different topical domains. Frequently, users must manually scan through a long list of interleaved results in order to zoom in onto the desired ones. A solution using an NLP approach One way to solve the problem is to figure out which one of the different word senses that a document assumes. For instance, let's say the term "Michael Jordan" carries two senses, the basketball player and the Berkeley professor. A documentssing, manning, processing, text, language, natural It seems a count threshold of 5 (or higher) is too restrictive. Too many otherwise good and valid keywords are discarded. A systematic method of evaluating clustering results is not immediately obvious. But I can demonstrate its effectiveness with a few examples. Let's try "Berkeley": query=Berkeley cluster namecluster sizecorrect countcorrect percentage1students22100%2seti@home, june 2002, 2002, sponsors11100%3library, libraries/ =!"#$%`!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqIKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې33100%4library, uc, collections, web11100%5berkeley, information8787.5%6seismological laboratory, earthquake information, earthquake, berkeley11100%7web22100%8{unrelated/lonely}12 Of course, it easy to have high correct pe containing "Michael Jordan" will use the term to denote either the basketball player or the professor, but probably not both. This is an important property because we can rely on the set of word senses to categorize documents, and thereby solving the above problem. Our goal now has been reduced to defining such a function that maps the set of documents containing W to the set of W's word senses. Like all other NLP problems, however, the gut of this function is completely non-trivial. For example, given a search query, how do we obtain the set of word senses? One possible solution is to use a dictionary like WordNet, but this approach falls apart quickly as soon as we have to deal with non-standard English terminologies, acronyms, compound nouns, etc. The space of all human-understandable terms is simply too huge. Moreover, word senses associated with a term can change over time, and thus maintaining a database of all sets of word senses would require too much work and human intervention. These problematic issues will be addressed more thoroughly in later sections of this report. My implemented solution involves inserting an extra processing layer between the users and a selected search engine, Google. This layer, which shall be referred to as Project Enhanced Google or PEG for short, will try to categorize search results returned by Google and present the categorized results neatly to the user. PEG's success will be judged by how well it catagorizes web documents, and how much convenience it brings to its users. PEG has a five-stage pipeline: Invoke the parent search engine to obtain a list of "raw" search results. Fetch and purify web documents. Extract features from web documents. Based on the extracted features, cluster the documents into groups by their similarity. Generate HTML for the end-user view. Detailed discussion of algorithms/method used Stage 1: Obtain raw search results I picked Google as the parent search engine because I like it myself, and because it allows search requests via a SOAP interface. By using the web service directly, I save the need to parse the usual HTML output manually while at the same time maintain future compatibility with this well defined interface. The plumbing behind this stage is of little NLP interest, and so I will reframe from elaborating further. Do keep in mind though that Google limits each licensee of the web service to 1000 requests per day. The quota is plentiful for obtaining search results, but I can't utilize Google's vast database to, for example, help examine keywords on a word-by-word basis. Stage 2: Fetch and purify web documents The Network API in Java is one of its strong points, so making HTTP requests to fetch web documents is straightforward. Originally, I was fetching documents sequentially (synchronous calls), but as soon as I started downloading 30 or 40 in one go, the total time spent just in this stage would exceed many minutes due to long socket timeouts and slow servers. To establash a soft upperbound in time requirement, I had to multithread this portion of the program to download multiple documents simultaneously, and upgrade to JDK 1.4 which provided an easy way to shorten socket timeouts. Some non-text document sources are deemed unprocessible by PEG. I exclude URLs with certain file extensions like .zip, .doc, .pdf, etc. The purification procedure is more interesting. I decided to convert everything to lower case. It's true that I will lose all information provided by word capitalization, but in return, the problem of data sparseness is alleviated, which I consider a more immediate problem. HTML documents on the web are not what you normally consider "clean" text. They often contain nasty junk like malformed XML syntax, javascript, symbols; basically, it's often the case that you don't find any clean paragraph of English sentences on a page. There's not much I can do about them, but I try my best using various adhoc heuristics, like filtering out tokens without any alphanumeric character, small integers, and stopwords. In addition, I try to purify each token by cropping away outer junk characters (e.g: punctuations, parenthesis). The purified text, judged by my eyes, is generally satisfactory and usable for all intents and purposes. Note that I throw away all information about sentence boundary, although the general word order is undisturbed. I learned that parsing web data is a huge pain in the neck. Stage 3: Extract features Extracting features from web documents turned out to be the core of the project and the most interesting problem from the NLP point of view. A web document contains a rich amount of information, from the text itself, URL, hyperlinks, HTML structure, to images. Being an NLP project, I decided to treat a web document simply as a sequence of words, without exploiting any of the informative HTML structure (e.g: title string) or hyperlinks. The quality of extracted features will suffer slightly, but at the same time the result of the project will reflect unambiguously the (in)effectiveness of NLP algorithms. Instead of using raw text as the feature set, I decided to distill it a bit by extracting keywords. "Keywords" of a document is defined to be a set of tokens that appear abnormally frequently in the document compared to the norm (i.e: normal English). Such a definition translates directly to a computable algorithm. Let count(w|doc) = { occurances of word w in document doc } Let count(doc) = { the number of tokens in doc } Hence, prob(w|doc) = count(w|doc) / count(doc) To model "normal English", I went to three years worth of New York Times articles from 1994 to 1996. 1.7 gigabytes of text is a ton of data, and in the end, I was able to digest approximately 240 million words, 640,000 of which are unique. I have to apply a primitive smoothing adjustment to handle unknown words, which are encounted often on the web. Let count(w|langauge) = { occurances of word w in the corpus } Let count(language) = { total number of tokens in the corpus } prob(w|language) = if count(w|language)==0, then return 0.5 / count(language) otherwise, return count(w|language) / count(language) Note that with this simple smoothing, prob(w|language) is no longer a valid probability mass function, but that is inconsequential as only the proportion is relevant in my calculation. Because count(language) is such a huge number, 0.5 is a decent guess of the number of occurances of unknown words. For each w in doc Let ratio = prob(w|doc) / prob(w|language) Let count = count(w|doc) w is a keyword of doc if ratio >= THRESHOLD_RATIO and count >= THRESHOLD_COUNT In English, the algorithm describes that a word is a keyword if the following two contraints are satisifed: The word appears abnormally frequent The word has at least repeated several times in the document The constraints make a lot of intuitive sense, and reflect the definition of keywords. The tricky part is deciding what to assign the two threshold values. As alluded in the introduction, I don't have access to a training set (general documents with labeled keywords), so it's impossible to train for threshold values aside from using my own judgement. This lack of a training set would continue to haunt me later from many other angles. Perhaps I should have placed more emphasis in locating a usable training set from the get go. It appears that THRESHOLD_RATIO=80 and THRESHOLD_COUNT=4 work decently well. Again, these values were picked without any theoretical backing, but most of the time the keyword selection is surprisingly accurate. Rarely does the algorithm fail disastrously. The decision of whether a word is a keyword is not entirely binary. Some keywords are naturally more relevant than others in a document. For example, given all else equal, a word that appears twice as frequently as the other word is obviously more likely to carry more relevance. Similarly, a word with a higher ratio should be considered more relevant. I use a simple scheme in computing this "importance" value. importance(keyword) = ratio(keyword) * count(keyword) Because the ratio equation is divisional, it can easily overwhelm the count value. Therefore, an unknown word may end up grabbing a huge share of importance, which arguably is the correct course of action. By not using raw text as features, I avoid potential problems comparing two documents with highly skewed lengths. As long as the number of keywords for each document is controlled, comparison among documents will not run into unnecessary biases towards or against long or short documents. As a protective measure, I keep only the top 10 (MAX_KEYWORDS) most important keywords to set a hard ceiling on the size of the feature set. So far, I have considered locating keywords which are unigrams, but there is a big problem: a keyword often comes packaged as a bigram, trigram, or an n-gram in general. For example, among many other things, names of people are bigrams or trigrams. Also, for many compound terms, the "worth" of the whole is greater than the sum of its individual constituents. "Operating System" means something special when used together, for instance. Indeed, the problem of finding collocated keywords is real and crucial. How should I extend the algorithm? The first question I asked was whether it is feasible to include more information witrcentage if cluster size is small and the cluster group general enough. The documents in the {unrelated/lonely} cluster are either featureless, or unrelated to all other documents in the pool. Unfortunately, although the quality of individual clusters is decent, we would have preferred a fewer clusters. For example, the "students" and "berkeley, information" should form a larger cluster. Also, the number of unsorted documents fallen into the {unrelated} cluster is substantial. I mentioned that earlier that I evaluate cluster "compactness" by computing its average internal distance in a clique of documents configuration, how the compactness function is a decreasing function as the number of clusters approaches one. Here's an illustration:  Note that this function doesn't exhibit a "cliff", or a suddent drop, as n approaches 0. Let's search for "doom": query=doom cluster namecluster sizecorrect countcorrect percentage1doom, doom iii, 2002, iii161487.5%2science fiction fantasy horror, planet doom, sites, science fiction11100%3sitebad category4sgi doom, sgi version, version sgixdoom11100%532+32kwb asus tx97-e 430tx.... 2000, pv, pci11100%6armdeu 1.17, armdeu, wadptr 2.3, wadptr11100%7linux, 10%, 2.5%, freebsd11100%8{unrelated/lonely}7 The cluster labeled "site" is a bad category because it doesn't mean anything. It would have been preferred if the "sgi doom" and "32+32kwb" (a benchmarking site) clusters are merged with the main "doom" (the idSoftware game) cluster. PEG was able to single out a cluster for a horror story, and an essay about linux being Microsoft's impending doom.   Related work in the literature There have been several text clustering systems developed in the past. Xerox PARC's Scatter/Gather implemented a similar system, but it didn't try to distill document features (i.e: keywords), and it used a straightforward cosine measure to define inter-document distances. (http:///www2.parc.com/istl/projects/ia/papers/sg-sigir92/sigir92.html) Andrei Broder of DEC had also done some work in defining a inter-document distance function specifically targeted for web document. It used the mathematical notion of "resemblance" and "containment" (symmetric vs asymmetric relationship), the latter of which PEG does not attempt to account for. Instead of using keywords, it used shingles, which are essentially n-grams. (http://www.scope.gmd.de/info/www6/technical/paper205/paper205.html) The "WiseNut. Search. Exactly" search engine seems to perform web document categorization as well, but unfortunately, no technical information describing the used algorithms is present on the website. (http://www.wisenut.com/) Conclusion Because PEG employs Google as the underlying search engine, technically, the search results cannot be any worse than that of raw Google. PEG attempts to extract more information from the list of search results, and specifically, it tries to categorize documents based on their semantic similarity. Keywords extracted have turn out to be an acceptable feature set to identify web documents, and as a bonus, it acts as a quick summary of a web document that users of PEG might find useful. The clustering algorithm is definitely suboptimal, and more work on defining better and more semantically-sensitive distance/similarity functions would certainly improve the performance of the system substantially.  EMBED Excel.Sheet.8   EMBED Excel.Sheet.8   EMBED Excel.Sheet.8  24ԓ֓ؓ$&\0z*tvxԤ֤ & F$֤jlnVXjlP|$BB6Bz|THJNP  & F h ,zNDz|8:nplnp(@-     * !"#$%&')@+,>./0123456789:;<=F?CAB\DEZGHIJKLMNOPQRSTUVWXYc[`]^_{abydefghijklmnopqrstuvwx~z|}np"\ 8 ^ `   D| \ & F h & F h2FH.0))++....222233N6P67777*888888899*9\9^99:::<<<<<<===dl$$l $=J=L=N=v======>">$>h>j>l>n>@@AHAAABfBhBBhp$$l $BBBBjClCCDDDEEEF2F GG@GGGGHHHIII\<4 $$l $IKKFKLLLMMBMtMMMMMMNNN:NPNRNNOOHOhP dd$$l $h my language model. I thought about building a bigram model, which would be huge in size due to sheer number of unique bigrams expected to be found. If the space of unigram is barely managable, then the space of bigrams is certainly unthinkable. The problem of data sparseness with a bigram model will simply balloon to the point of being unusable. In fact, the number of false positives for bigram keywords is too great to justify the extra space (HUGE!) needed for the bigram model. Not only that, a bigram model still doesn't scale well to deal with n-gram keywords regardless. Then I started thinking inductively. If I make one plausible assumption, I can collect collocated keywords relatively straightforwardly: an n-gram is a keyword only if all its substrings of length (n-1) are also keywords. For example, if "operating" and "system" have both established themselves as keywords, then "operating system" and "system operating" are both candidates to become bigram keywords. I will enforce an additional constraint where the n-gram keyword must also have appeared at least a certain number of times to qualify. By the way, this algorithm was inspired by the A-Priori algorithm for computing frequent item sets. Recursive algorithm: An n-gram w is a keyword if: If (n==1) then, return true if w is a unigram keyword; return false otherwise. Otherwise, Let left = w[0,len-2] // substring Let right = w[1,len-1] Let count = count(w|doc) return true if both left and right are keywords and count >= THRESHOLD_COUNT return false otherwise. Note that I'm not accounting for the ratio factor which I used in judging unigram keywords. The reason is that prob(n-gram|language) is a tough one to define with just a unigram frequency language model. The Naive Bayes Assumption, which worked admirably in the word sense disambiguation project, was a complete failure here. Given the immense size of the corpus, prob(unigram|language) was already tiny. Applying the Naive Bayes Assumption, a series of successive multiplications, led to microscopic prob(n-gram|language), which in turn led to exceeding large and useless ratio values. By making the assumption that an n-gram is a keyword only if all its substrings are also keywords, I'm oversimplying the space of n-gram keywords. A serious shortcoming is true negatives on n-gram keywords which contain a common word. For instance, "My Little Lover", a popular band in Japan, will never be collected as a trigram keyword because "my" is a common/stop word, and hence not a keyword by itself. But other than that, many quality compound keywords are successfully assembled from the list of unigram keywords. The last part of the keyword extraction phase is removing duplicated (n-1)-gram keywords that have ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PDyK sammysy@cs.stanford.eduyK >mailto:sammysy@cs.stanford.eduDd 0  # A2cw~Qk +2 `!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bq where 0 <= i < len(A), 0 <= j < len(B) total = total + keywordDistance(A(i), B(j)) return total / (len(A) * len(B)) Essentially, it's computing the average word-to-word distance for each pairs of keywords in the two sets. It has the property that documentDistance(A,B) == documentDistance(B,A). The effect of the Cluster-to-Cluster distance function is more subtle. I've experimented with two definitions. In the first one, I defined the inter-cluster instance to be the distance of the two closest points between the two clusters. The advantage of this definition is that it allows for more "irregularly-shaped" clusters, but it's also less stable, leading to ill-separated clusters. The second definition is based on the average document-to-document distance between each pair of documents in the two clusters. The advantage of this definiton is robustness against linking two well separated large clusters, but the shapes of the clusters tend to be "spherical" (i.e: likes to expand uniformly in all direction). Fortunately, due to the relatively few data points (30-50), both definitions perform about identically. I employ the straightforward Hierarchical Clustering algorithm to group the documents, but exactly how do you determine the optimal number of clusters? We can compute the quality of a cluster by its "compactness", approximated by its average pair-wise document-to-document distances (think: clique). It's a measure of how well-separated clusters are to one another. If we define compact(n) to be the micro-average cluster compactness for having n clusters (chosen by the hierachical clustering algorithm), then it can be proven that it's an increasing function (decreases as n approaches 1); when n == number of dyK 7http://cutevoice.stanford.edu/~sammysy/google/src/web/yK nhttp://cutevoice.stanford.edu/~sammysy/google/src/web/QDyK 7http://cutevoice.stanford.edu/~sammysy/google/s>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PQDyK 7http://cutevoice.stanford.edu/~sammysy/google/src/web/yK nhttp://cutevoice.stanford.edu/~sammysy/google/src/web/QD d[2O#3]3GdHņFc? [S|[ckSU܍yy|Mּ>IKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(.Dd 0  # A2FJO~v`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMyK 7http://cutevoice.stanford.edu/~sammysy/google/src/web/yK nhttp://cutevoice.stanford.edu/~sammysy/google/src/web/@@rc/web/yK nhttp://cutevoice.stanford.edu/~sammysy/google/src/web/,8@HPss Sammy Sy ammNormaly Sammy Sy 271Microsoft Word 8.0@@T@(@eV  ^ Oh+'0X    ,8@HPss Sammy Sy ammNormaly Sammy Sy 272Microsoft Word 8.0@@T@(@eV ^ocuments, compact(n) = 1.0, but as soon as we start grouping different documents together, the overall compactness decreases, approaching 0.0. Naturally, we want clusters to be as tightly packed as possible, but it can also be argued to we prefer as few necessary groups as possible. My hope is that these two competing qualities will single out an optimal number of clusters. I do not know for certain how to model "as few necessary groups as possible", and without a training set, it's tough to guess thresholds. In the end after taking PEG for a few manual test drives, I decided to use a normal distribution, with a mean of 6.0 and variance of 5.0 (in units of clusters). These are arbitrary, but nevertheless not completely random. Afterwards, each cluster is labeled by its prominent keywords, where prominence is measured by the number of documents in the cluster containing a particular keyword. I choose the top four most prominent keywords as to control the length of a cluster label. Regrettably, I'm most dissatisfied with this part of the project where I have to determine the optimal number of clusters without any reasonable justification. There has to be a better, more logically-sound way to compute this function, but unfortunately I ran out of time. An EM approach maybe able to group and label clusters simultaneously. Stage 5: Generate HTML output for the end-user view With the clustered groups, I'm finally ready to generate HTML code for the search results page. This is straightforward, but there are a few features worth mentioning. First, because PEG goes out of its way to download web documents pointed by the search results, it is able to show precisely which URLs are unavailable at the moment (e.g: 404, 403 errors). Secondly, the result page includes the list of keywords associated with each document, which acts as a decent summary of the content ahead. Results and performance The problem at hand is too complex for a machine to automatically decide whether a particular cluster configuration is "good". Without a usable traininig set, tuning the system and experimenting with new ideas quickly became an exhaustive task of testing on a few hopefully representative test search queries and eyeballing results manually. Certainly, this is not the best approach in judging an NLP system. We start with the performance of the keyword extraction engine. With the default parameters, I computed the sets of keywords, ordered by importance, for the following sites (as of today): threshold=80, count=4 http://www.berkeley.edu/uc, berkeley, campus, sites, student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/u.s, cnn.com, cnn, videohttp://www.stanford.edu/class/cs224n/java, natural language processing, language processing, processing, assignments, assignment, gates, section, programming, pleasehttp://nlp.stanford.edu/~manning/+1 650, nlp, probabilistic, ling, linguistics, grammar, natural language processing, christopher manning, language processing, manning If I reduce the ratio threshold to 40, I will obtain the following additional keywords: threshold=40, count=4 http://www.berkeley.edu/http://www.stanford.edu/http://www.mit.edu/newshttp://www.cnn.com/sportshttp://www.stanford.edu/class/cs224n/final projecthttp://nlp.stanford.edu/~manning/ As you can see, there is not a whole lot of differences despite a significant reduction in the ratio threshold. This is good news because it means good keywords tend to be good at prevailing themselves from the norm. Let's reduce the count threshold, while returning the ratio threshold back to its original value: threshold=80, count=3 http://www.berkeley.edu/uc, berkeley, cal, campus, extension, sites, sessions, student, students, junehttp://www.stanford.edu/alumni, stanford university, stanfordhttp://www.mit.edu/mit, events, programshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, jagger, cnn, video, die, gashttp://www.stanford.edu/class/cs224n/ling, java, natural language processing, language processing, processing, directory, assignments, assignment, manning, gateshttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic, ling, science linguistics, linguistics, computational, grammar threshold=80, count=2 http://www.berkeley.edu/2002, a-z, sessions uc extension, finder map, uc, undergraduate, berkeley, cal student, web sites, librarieshttp://www.stanford.edu/directories, alumni, stanford university, stanfordhttp://www.mit.edu/mit web, search mit, mit, directory, events calendar, faculty staff, tech, campus, offices programs, massachusettshttp://www.cnn.com/u.s, cnn.com, instyle.com, sci-tech, time.com, 2002, jagger knighted, mick jagger, in-depth, swamp netshttp://www.stanford.edu/class/cs224n/cs 224n ling 237 natural language processing, eval.pl script, e.g., hyponym, wmhome, jwordnet, stdin", ling, java, computationalhttp://nlp.stanford.edu/~manning/+1 650, nlp, 2002, syntactic, probabilistic models, probabilistic, hinrich sch tze, 2001, 2000, 1999 A lot more unimportant keywords seem to have cropped up when the threshold is set too low. Let's increase the count threshold: threshold=80, count=5 http://www.berkeley.edu/student, studentshttp://www.stanford.edu/stanfordhttp://www.mit.edu/mithttp://www.cnn.com/cnn, videohttp://www.stanford.edu/class/cs224n/language processing, processing, section, naturalhttp://nlp.stanford.edu/~manning/probabilistic, linguistics, natural language processing, christopher manning, lanugage proce ՜.+,D՜.+,, hp|   o0s   Title(RZ _PID_GUID _PID_HLINKSAN{4B7A3114-7B44-11D6-96A8-00E0980A08CC}A |g'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5P A running demo can be found at  HYPERLINK http://cutevoice.stanford.edu/~sammysy/google/src/web/ http://cutevoice.stanford.edu/~sammysy/google/su#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\*ssing, manning, processing, text, language, natural It seems a count threshold of 5 (or higher) is too restrictive. Too many otherwise good and valid keywords are discarded. A systematic method of evaluating clustering results is not immediately obviou7http://cutevoice.stanford.edu/~sammysy/google/src/web/J0mailto:sammysy@cs.stanford.edualy Sammy Sy 272Microsoft Word 8.0@@ [$@$NormalmH 0@0 Heading 1$@&58@8 Heading 2$@& 6OJQJ:`: Heading 3$@&56OJQJ8`8 Heading 4$@& CJ OJQJ<A@<Default Paragraph Font(U@( Hyperlink>*B*8T^r8TW^rap~&' , Q " ?@!"< R&fz/[v>c !!^#_###f$g$&&((**c-d-z------!.;.W.....!1"10313K5s. But I can demonstrate its effectiveness with a few examples. Let's try "Berkeley": query=Berkeley cluster namecluster sizecorrect countcorrect percentage1students22100%2seti@home, june 2002, 2002, sponsors11100%3library, libraries33100%4library, uc, collections, web11100%5berkeley, information8787.5%6seismological laboratory, earthquake information, earthquake, berkeley11100%7web22100%8{unrelated/lonely}12 Of course, it easy to have high correct pe/ =!"#$%`!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqIKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ېrcentage if cluster size is small and the cluster group general enough. The documents in the {unrelated/lonely} cluster are either featureless, or unrelated to all other documents in the pool. Unfortunately, although the quality of individual clusters isL599:::;;;;==0=C=D=?????@@B@[@\@@@BBCCCCCD7D8DDD+H,HzM{M~NNOOOPRRRRR RSSTTYUZUpUUUUUUUUUU V"V#VIVVVVtWuWvWWWWWWXXXX/X4X5XIXPXQXwXXXXXXXYYYZgZhZZZZZZZZ)[*[P[[[[\\]\^\t\\\\]G]H]\]]]]L^M^s^^^_|_}_~___`.`@`A`Z`c`d`x`|`}```````aaaa%b&bbbbbbbcccc$c&c(c-c.c0cUcWcYc^c_cactcvcxc}c~ccccccccccccccddddd d$d&d(d-d.d0dCdFdGdHdIdJdZf[ffgggigggggggghh$h%h'hAhDhGhMhNhPhhhhhhhhhhhhhhhhhhhiii!i"i$iLiNiPiUiViXiritivi{i|i~iiiiiiijjjjjjkk{l|l9n:noo o+o,orr_r8p      (   """""""""""""""""""""""""" " """"""""""""""""""""""""""""""""""""""""""" : : :: : : :(p.T'<(W4qXo$<Wz*>FyX[]ma8duf8gkklpv:$[mΦNt@zKi~::`X^|Jz|j2 HlHn$dT`  B"\$%@./~5yz8.#)28:;C\EI֤B n7=BIhPSRU\^B`pN||*JUjZ2biHolpuHz{prtuwxz{|~      !"#,-./0123456789:;GHIJKLMNOPQRSTUV<.G`igpqsvy}UnknownSammy SyG_STVT^rXX358OQW:::?2$cw~Qk +2$$C9V/DX2$FJO~Q@ (  H  C A ?H  C A ?H  C A ?B S  ?ggjj^r 4 4 4`SSWTTnnooqr_r`STnnooqr_rSammy SyC:\WINDOWS\Desktop\report.docSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asd ZT %< 7< t K PZ hh.hh.hh.hh.hh.hh.%<ZTPZ7<t K@TTsTSaiopJJKJLJVJ_J}JOOSSSSSSTTT TVTWTXT[T{T}TTTuWY&b(b3b=bHb^bvbxbbbbbbbbbbbbbbbccccccc#c$c%c&c'c(c,c.c/c0c9c;cTcUcVcWcXcYc]c_c`cacsctcucvcwcxc|c~cccccccccccccccccccccccddddddddd d#d$d%d&d'd(d,d.d/d0d:dAdBdCdEdGdIddddd5eoeeeeeYfZfffffgLgegfggghgigmgqgggggggggggggggghhhh#h$h%h&h'h@hAhChDhFhGhKhLhMhNhOhPh]hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhiiiiii i!i"i#i$iKiLiMiNiOiPiTiUiViWiXiqirisitiuivizi{i|i}i~iiiiiiiiii%j;jDjdjijtjzjjjjjjjjjjjjk kkkkLkTksktkkkkl1lr?r@rArBrHrVrWrXrYrZr[r\r]r^r0ت1101001l?1n?1p?1?1?0?1I0J0111110111111111X1\00NR0X0$\1dm1hm1~m1m1m1m1n1n1nn1n0n0n1n1n1n0n0n1n0n1n0o1o0 o1"o0Fo0Ho1Jo0Lo1No0^o1`o0bo1do0fo1ho0po1to0vo1xo1o1o0o1o0o1o0o1o0o1o0o1o0o1p0p1p0p1p0p1p0p1p0Rp1Tp0Vp1Xp0Zp1\p0dp1hp0jp1lp0p1p0p1p0p1p0p1p0p1p0>q1@q0Bq1Dq0Fq1Hq0Pq1Tq0Vq1Xq0^q1`q0bq1dq0fq1hq0pq1tq0vq1xq1q1q0q1q0q0q0q1r1r1r1r1s1s1nt1t1t1u0u0u1v1v1v1v18w1w0w0w1w0w1w1w1w18x1Lx1tx0x0x1x1x0x0x1x0x0x1x0y1y0y1y08y1:y0^y0`y1by0dy1fy0y1y0y1y0y1y1y0y0y1y0y1y1y0>z1@z0Bz1Dz0Fz1Hz0Pz0Rz1Tz0Vz1Xz0`z1bz0zz0|z0~z0z1z0z1z0z1z0z1z0z1z0z0z1z0z1z0F{1H{0J{1L{0N{1P{0X{0Z{1\{0^{1`{0{1{0{1{0{1{0{0{1{0{1{0{1{0{1|0|1|0 |0|1|0|1|1(|16|08|1:|0<|0@|0B|0D|1F|1b}1}1}1}1}1~1 ~1~1 ~1~01 0 100011&100R0T1V111111Z1ր1:1z11΁10 1B1L1V1`111Ƃ11111Ѓ11d1j11100111111ʅ11111110R0T1X0l0n1X1h111ĉ1ȉ1։1؉11@11F1>1T1h10"0ҍ1(1*1,181T1V1X1Z0\0^1`1b1d1p11110011111č1ƍ1ȍ1ʍ0̍0΍0Ѝ0ҍG:Times New Roman5Symbol3& :Arial?5 :Courier New"1h8HfLf`~^0Q :}#4dwsEC=Sammy SySammy Sy :Courier New"1TVXnp$&(^`΍ЍҍԍXZ\`bƍȍʍ̍ҍԍX\@B68@B68;ָ֬;ָ֠;ָjBOJQJUjOJQJU H*OJQJ0JjOJQJUjOJQJUOJQJ jrUj@ UV jrUj4@ UV jU jU. decent, we would have preferred a fewer clusters. For example, the "students" and "berkeley, information" should form a larger cluster. Also, the number of unsorted documents fallen into the {unrelated} cluster is substantial. I mentioned that earlier that I evaluate cluster "compactness" by computing its average internal distance in a clique of documents configuration, how the compactness function is a decreasing function as the number of clusters approaches one. Here's an illustration:  Note that this function doesn't exhibit a "cliff", or a suddent drop, as n approaches 0. Let's search for "doom": query=doom cluster namecluster sizecorrect countcorrect percentage1doom, doom iii, 2002, iii161487.5%2science fiction fantasy horror, planet doom, sites, science fiction11100%3sitebad category4sgi doom, sgi version, version sgixdoom11100%532+32kwb asus tx97-e 430tx.... 2000, pv, pci11100%6armdeu 1.17, armdeu, wadptr 2.3, wadptr11100%7linux, 10%, 2.5%, freebsd11100%8{unrelated/lonely}7 The cluster labeled "site" is a bad category because it doesn't mean anything. It would have been preferred if the "sgi doom" and "32+32kwb" (a benchmarking site) clusters are merged with the main "doom" (the idSoftware game) cluster. PEG was able to single out a cluster for a horror story, and an essay about linux being Microsoft's impending doom.   Related work in the literature There have been several text clustering systems developed in the past. Xerox PARC's Scatter/Gather implemented a similar system, but it didn't try to distill document features (i.e: keywords), and it used a straightforward cosine measure to define inter-document distances. (http:///www2.parc.com/istl/projects/ia/papers/sg-sigir92/sigir92.html) Andrei Broder of DEC had also done some work in defining a inter-document distance function specifically targeted for web document. It used the mathematical notion of "resemblance" and "containment" (symmetric vs asymmetric relationship), the latter of which PEG does not attempt to account for. Instead of using keywords, it used shingles, which are essentially n-grams. (http://www.scope.gmd.de/info/www6/technical/paper205/paper205.html) The "WiseNut. Search. Exactly" search engine seems to perform web document categorization as well, but unfortunately, no technical information describing the used algorithms is present on the website. (http://www.wisenut.com/) Conclusion Because PEG employs Google as the underlying search engine, technically, the search results cannot be any worse than that of raw Google. PEG attempts to extract more information from the list of search results, and specifically, it tries to categorize documents based on their semantic similarity. Keywords extracted have turn out to be an acceptable feature set to identify web documents, and as a bonus, it acts as a quick summary of a web document that users of PEG might find useful. The clustering algorithm is definitely suboptimal, and more work on defining better and more semantically-sensitive distance/similarity functions would certainly improve the performance of the system substantially. And yeah, NLP is hard.  EMBED Excel.Sheet.8   EMBED Excel.Sheet.8   EMBED Excel.Sheet.8  \0z*tvxjlnVXjlP & F|$B6Bz|THJN & F hNP ,zNDz|p|  8:npl n p  !""\"$%8%^%`%))*D*|* & F h & F h|** +\+++,,//1122223F3H344.;0;EEGGJJJ9JJNNNNOONRPRSSS*TTTTTTTUU*U\U^UU܌dܸܠ$$l $UVVVXXXXXXYYYJYLYNYvYYYYYYZ"Z$ZhZjZlhp$$l $jZlZnZ\\]H]]]^f^h^^^^^j_l__```aaab2b\<$$$l 2b cc@ccccdddeeeggFghhhiiBitiiiii4  $$l $iijjj:jPjRjjkkHkhljlllbmdmnnnnno"oHodd$$$l HoJoNo`odohorotoxoooooooppppppTpXp\pfphplpT|$$$ljm  A running demo can be found at  HYPERLINK http://cutevoice.stanford.edu/~sammysy/google/src/web/ http://cutevoice.stanford.edu/~sammysy/google/src/web/ . This site will last until the 12th when I move out of campus housing. A running demo can be found at  HYPERLINK http://cutevoice.stanford.edu/~sammysy/google/src/web/ http://cutevoice.stanford.edu/~sammysy/google/src/web/ . This site will only last until the 12th when I move out of campus housing.  [$@$NormalmH 0@0 Heading 1$@&58@8 Heading 2$@& 6OJQJ:`: Heading 3$@&56OJQJ8`8 Heading 4$@& CJ OJQJ<A@<Default Paragraph Font(U@( Hyperlink>*B*8Tir8TWirap~&' , Q " ?@!"< R&fz/[v>c !!^#_###f$g$&&((**c-d-z------!.;.W.....!1"10313K5L599:::;;;;==0=C=D=?????@@B@[@\@@@BBCCCCCD7D8DDD+H,HzM{M~NNOOOPRRRRR RSSTTdUeU{UUUUUUUUUVV-V.VTVVVVWWWWWW X X X$X%X&X:X?X@XTX[X\XXXXXXXXYY Z#ZrZsZZZZZZZZ4[5[[[[[[g\h\i\\\]]]R]S]g]]]]W^X^~^^_"____ ` ` `9`K`L`e`n`o````````aa#aaaa0b1bbbbbbcc#c$c&c/c1c3c8c9c;c`cbcdcicjclcccccccccccccccccccd!d#d(d)d+d/d1d3d8d9d;dNdQdRdSdTdUdefffqgrgtggggggghhh/h0h2hLhOhRhXhYh[hhhhhhhhhhh !"#$%&')*+.12345689:;<=>?@CBDElppppppp@qDqHqRqTqXq`qdqhqrqtqxqqqqqqquL@l$$ljm $uuwwwxxxxxxyy:y`ybyfyyyyyyy@zDzHzݤ@$$ljm $!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bq|@|B|D|F| TVh$$ljm $hhhhhhhh#i%i'i,i-i/iWiYi[i`iaici}iiiiiiiiiiiikkkk k k)k*kllDnEn)o*o+o6o7orrjr8p      (   """""""""""""""""""""""""" " """"""""""""""""""""""""""""""""""""""""""" : : :: : : :TVXnp$&(^`΍ЍҍԍXZ\`bƍȍʍ̍ҍԍX\@B68;ָ֬;ָjOJQJU H*OJQJ0JjOJQJUjOJQJUOJQJ jrUj@ UV jrUj4@ UV jU jU$rc/web/ . This site will last until the 12th when I move out of the dorm. A running demo can be found at  HYPERLINK http://cutevoice.stanford.edu/~sammysy/google/src/web/ http://cutevoice.stanford.edu/~sammysy/google/src/web/ . This site will last until the 12th when I move out of campus housing.  [$@$NormalmH 0@0 Heading 1$@&58@8 Heading 2$@& 6OJQJ:`: Heading 3$@&56OJQJ8`8 Heading 4$@& CJ OJQJ<A@<Default Paragraph Font(U@( Hyperlink>*B*8Tdr8TWdrap~&' , Q " ?@!"< R&fz/[v>c !!^#_###f$g$&&((**c-d-z------!.;.W.....!1"10313K5L599:::;;;;==0=C=D=?????@@B@[@\@@@BBCCCCCD7D8DDD+H,HzM{M~NNOOOPRRRRR RSSTT_U`UvUUUUUUUUUUV(V)VOVVVVzW{W|WWWWXXXX X!X5X:X;XOXVXWX}XXXXXXXYYZZmZnZZZZZZZZ/[0[V[[[[b\c\d\z\\]]]M]N]b]]]]R^S^y^^^____```4`F`G```i`j`~`````````aaaa+b,bbbbbbb ccc!c*c,c.c3c4c6c[c]c_cdcecgczc|c~ccccccccccccccccddd#d$d&d*d,d.d3d4d6dIdLdMdNdOdPd`faflgmgogggggggg hh*h+h-hGhJhMhShThVhhhhhhhhhhhhhhhhhhhi i"i'i(i*iRiTiVi[i\i^ixizi|iiiiiiiiiijjkkkk$k%kll?n@n$o%o&o1o2o r rer8p      (   """""""""""""""""""""""""" " """"""""""""""""""""""""""""""""""""""""""" : : :: : : :(vd3;Զmvv;:kwTZzUfB55TG0~|g1ې'c3==L%UL%אOjۙiomZca<>ED:xy>̬q}.n88':3'݃xGy }L Aڇ0dN*q}([ӝr|)Ta*ױӎh{φR yFq͹Hg` dP(Եi|7>z5PDyK sammysy@cs.stanford.eduyK >mailto:sammysy@cs.stanford.eduDd0  # A2|$C9V/DXB: `!P$C9V/D (#GJ*xZ hU~wS R0C  ,6J\[Wm)6c(I0\*(EjR2)I (ZI#B? o=9wg}sy{''٘IY^XWz'yB.Bқ"K"8x~&2$NVKvEww7ơ}. 7#F4NL_ d[2O#3]3GdHņFc? [S|[ckSU܍yy|Mּ>IKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(.Dd   ABa=f =x8X1(Arial1(Arial1(Arial1(Arial1(Arial1(Arial1(Arial1(Arial1(Arial1(Arial1(Arial1(Arial1(Arial"$"#,##0_);\("$"#,##0\)!p.T'<(W4qXo$<Wz*>FyX[]ma8duf8gkklpv:$[mΦNt@zKi~::`X^|Jz|j2 HlHn$dT`  B"\$%@./~5yz8.#)28:;C\EI֤B n7=BIhPSRU\^B`pN||*JUjZ2biHolpuHz{prtuwxz{|~      !"#,-./0123456789:;GHIJKLMNOPQRSTUV<.G`igpqsvy}UnknownSammy SyG_STVTdrXX358OQW:::?2$cw~Qk +2$$C9V/DX2$FJO~Q@ (  H  C A ?H  C A ?H  C A ?B S  ?mgjkdr 4 4 4`SSWTTnnooq rer`STnnooq rerSammy SyC:\WINDOWS\Desktop\report.docSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asd ZT %< 7< t K PZ hh.hh.hh.hh.hh.hh.%<ZTPZ7<t K@TTsTSaiopJJKJLJVJ_J}JOOSSSSSSTTT TVTWTXT[T{T}TTT{WY,b.b9bCbNbdb|b~bbbbbbbbbbbbbbb c cccc c!c)c*c+c,c-c.c2c4c5c6c?cAcZc[c\c]c^c_cccecfcgcyczc{c|c}c~cccccccccccccccccccccccccdddddd"d$d%d&d)d*d+d,d-d.d2d4d5d6d@dGdHdIdKdMdOddddd;eueeeef_f`fffffgRgkglgmgngogsgwggggggggggggggggh hhh)h*h+h,h-hFhGhIhJhLhMhQhRhShThUhVhchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhiii i!i"i&i'i(i)i*iQiRiSiTiUiViZi[i\i]i^iwixiyizi{i|iiiiiiiiiiiiiii+jAjJjjjojzjjjjjjjkkkkkk kk#k$k%kRkZkykzkkkkl7lBlalsllllllllll mImXmbmpmmmmmm>n?n@nCnEnMnUn_nmnnnnoo o#o$o&o0o1ooooo\p^pepfpppppqqqqq r rrrrr$r%r&r'r(r)r*r+r,r2r@rArBrCrDrErFrGrHrNr\r]r^r_r`rarbrcrdr0ت1101001l?1n?1p?1?1?0?1I0J011@1B1D1X111116181:1@1100NR0X0$\1dm1hm1~m1m1m1m1n1n1nn1n0n0n1n1n1n0n0n1n0n1n0o1o0 o1"o0Fo0Ho1Jo0Lo1No0^o1`o0bo1do0fo1ho0po1to0vo1xo1o1o0o1o0o1o0o1o0o1o0o1o0o1p0p1p0p1p0p1p0p1p0Rp1Tp0Vp1Xp0Zp1\p0dp1hp0jp1lp0p1p0p1p0p1p0p1p0p1p0>q1@q0Bq1Dq0Fq1Hq0Pq1Tq0Vq1Xq0^q1`q0bq1dq0fq1hq0pq1tq0vq1xq1q1q0q1q0q0q0q1r1r1r1r1s1s1nt1t1t1u0u0u1v1v1v1v18w1w0w0w1w0w1w1w1w18x1Lx1tx0x0x1x1x0x0x1x0x0x1x0y1y0y1y08y1:y0^y0`y1by0dy1fy0y1y0y1y0y1y1y0y0y1y0y1y1y0>z1@z0Bz1Dz0Fz1Hz0Pz0Rz1Tz0Vz1Xz0`z1bz0zz0|z0~z0z1z0z1z0z1z0z1z0z1z0z0z1z0z1z0F{1H{0J{1L{0N{1P{0X{0Z{1\{0^{1`{0{1{0{1{0{1{0{0{1{0{1{0{1{0{1|0|1|0 |0|1|0|1|1(|16|08|1:|0<|0@|0B|0D|1F|1b}1}1}1}1}1~1 ~1~1 ~1~01 0 100011&100R0T1V111111Z1ր1:1z11΁10 1B1L1V1`111Ƃ11111Ѓ11d1j11100111111ʅ11111110R0T1X0l0n1X1h111ĉ1ȉ1։1؉11@11F1>1T1h10"0ҍ1(1*1,181T1V1X1Z0\0^1`1b1d1p11110011111č1ƍ1ȍ1ʍ0̍0΍0Ѝ0ҍG:Times New Roman5Symbol3& :Arial?5 :Courier New"1h8HfLf` ^0Q :}#"$"#,##0_);[Red]\("$"#,##0\)""$"#,##0.00_);\("$"#,##0.00\)'""$"#,##0.00_);[Red]\("$"#,##0.00\)7*2_("$"* #,##0_);_("$"* \(#,##0\);_("$"* "-"_);_(@_).))_(* #,##0_);_(* \(#,##0\);_(* "-"_);_(@_)?,:_("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* "-"??_);_(@_)6+1_(* #,##0.00_);_(* \(#,##0.00\);_(* "-"??_);_(@_)                + ) , *  (0 (8 (0@ @ (0 @ (0@ (8@ 0  # A2FJO~"v`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυ4d}sESammy SySammy Sy+˺1W:&('$<&Uۃ ][Y: N] ?9ud ʮUHtdrchPjPlPbQdQRRRRRS"SHSJSNS`SdShSrStSxSSST$$ljm $$$l SSSSSTTTTTTTTXT\TfThTlTTTTTTT@UDUHURU|ߨߌL$$ljm $RUTUXU`UdUhUrUtUxUUUUUUUYY[[[\\\\\\@l$$$ljm \]]:]`]b]f]]]]]]]@^D^H^R^T^X^b^|^~^^^^^^ߤ@\$$ljm $^^^^^H_L_P_Z_\_`________`````:`>`@`B`ߘh$$ljm $B`D`F`c cccccTcVcffiiTkVkXknkpkpppppp$$ljm G pbjbjَ rU]...BH&H&H&H&L&BtH+(,+,+,+f+...m5o5o5o5o5o5o5$jI^KJ5..-^...5.,+f+W*(....Rl,+:.f+m5BB.m5.V. 34|,.m5f+*4?\ B$H&m.F[5/ =!"#$%`!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqIKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ېRoot Entry F?\ Data BWordDocument ObjectPool3D [$@$NormalmH 0@0 Heading 1$@&58@8 Heading 2$@& 6OJQJ:`: Heading 3$@&56OJQJ8`8 Heading 4$@& CJ OJQJ<A@<Default Paragraph Font(U@( Hyperlink>*B*8Tjq8TWjqap~&' , Q " ?@!"< R&fz/[v>c !!^#_###f$g$&&((**c-d-z------!.;.W.....!1"10313K5L599:::;;;;==0=C=D=?????@@B@[@\@@@BBCCCCCD7D8DDD+H,HzM{M~NNOOOPRRRRR RSSyTzTTTTTTTTUUU)UBUCUiUUU VVVVVVWWW W9W:W;WOWTWUWiWpWqWWWWWWWWY YY8YYYYYYYYYZIZJZpZZZ[|[}[~[[[\\4\g\h\|\\\]l]m]]^^7^^^^__5_N_`_a_z__________``8````EaFaaaaa bb%b8b9b;bDbFbHbMbNbPbubwbyb~bbbbbbbbbbbbbbbbbbbbb4c6c8c=c>c@cDcFcHcMcNcPcccfcgchcicjcze{efffffffg gg#g1gDgEgGgagdgggmgngpgggggggggggggghhh h h8h:hj?jkkYmZm>n?n@nKnLnqqqqqqkq8p      (   """""""""""""""""""""""""" " """"""""""""""""""""""""""""""""""""""""""" : : :: : : :(p.T'<(W4qqo$<z*>FyX[]ma8duf8gkklpv:$[mΦNt@zKi~::`X^|Jz|j2 HlHn$dT`  B"\$%@./~5yz8.#)28:;C\EI֤B n7=BIhPSRU\^B`pqprtuwxz{|~      !"#,-./0123456789:;<.G`igpqsvy}UnknownSammy SyG_jqX358OQW:::?2$cw~Qk +2$$C9V/DX2$FJO~Q@ (  H  C A ?H  C A ?H  C A ?B S  ?fjjjq 4 4 4m n"n"nkqm n"n"nkqSammy SyC:\WINDOWS\Desktop\report.docSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\T Chart1NSheet10Sheet2 Sheet3 TABLE;ZR3  @@  b, Sheet_Titlerea| bMMN 0Ir0bbM MMS0Lbt00MMM^n00b 00Ԣ0bbb0bbtb0bTABLEPercent0]y bΝ0\b0ZT0_T0=l0Te0T0T|?b 0|)bb@NJT0e0YT0L0b@b|0b~0BT0bb0bu0(1T00bbblb|0bZb&b0bZ 0  A"??3` Mh` Mh ` Mh ` Mh hп3d23 M NM4 EMP\AutoRecovery save of report.asd ZT %< 7< t K PZ hh.hh.hh.hh.hh.hh.%<ZTPZ7<t K@=n=nkd=n=naiopJJKJLJVJ_J}JOOVXFaHaSa]aha~aaaaaaaaaaaaa b bbb$b%b7b8b9b:b;bCbDbEbFbGbHbLbNbObPbYb[btbubvbwbxbyb}bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb3c4c5c6c7c8cc?c@cCcDcEcFcGcHcLcNcOcPcZcacbcccecgcicccddUdddddeyezeeee f0flffffffffffffffffffgg ggg"g#g0g1gCgDgEgFgGg`gagcgdgfgggkglgmgngogpg}ggggggggggggggggggggggghhhhhh h h h7h8h9h:h;hj?jljtjjjjjj1kQk\k{kkkkkkkkkkl#lclrl|lllll mmXmYmZm]m_mgmomymmmm n n"n#n=n>n@nJnKnnnnnvoxooooop7ppppqqqqqqq*q+q,q-q.q/q0q1q2q8qFqGqHqIqJqKqLqMqNqTqbqcqdqeqfqgqhqiqjqP؎QQPQPPQl#Qn#Qp#Q#Q#P#Q-P.P<P$@QdQQhQQ~QQQQQQQQRQRQnRQRPRPRQRQRQRPRPRQRPRQRPSQSP SQ"SPFSPHSQJSPLSQNSP^SQ`SPbSQdSPfSQhSPpSQtSPvSQxSQSQSPSQSPSQSPSQSPSQSPSQSPSQTPTQTPTQTPTQTPTQTPRTQTTPVTQXTPZTQ\TPdTQhTPjTQlTPTQTPTQTPTQTPTQTPTQTP>UQ@UPBUQDUPFUQHUPPUQTUPVUQXUP^UQ`UPbUQdUPfUQhUPpUQtUPvUQxUQUQUPUQUPUPUPUQVQVQVQVQWQWQnXQXQXQYPYPYQZQZQZQZQ8[Q[P[P[Q[P[Q[Q[Q[Q8\QL\Qt\P\P\Q\Q\P\P\Q\P\P\Q\P]Q]P]Q]P8]Q:]P^]P`]Qb]Pd]Qf]P]Q]P]Q]P]Q]Q]P]P]Q]P]Q]Q]P>^Q@^PB^QD^PF^QH^PP^PR^QT^PV^QX^P`^Qb^Pz^P|^P~^P^Q^P^Q^P^Q^P^Q^P^Q^P^P^Q^P^Q^PF_QH_PJ_QL_PN_QP_PX_PZ_Q\_P^_Q`_P_Q_P_Q_P_Q_P_P_Q_P_Q_P_Q_P_Q`P`Q`P `P`Q`P`Q`Q(`Q6`P8`Q:`P<`P@`PB`PD`QF`QbaQaQaQaQaQbQ bQbQ bQbPcQ cP cQcPcPcPcQcQ&cQ0cPRcPTcQVcQcQcQcQdQdQZdQdQ:eQzeQeQeQeP fQBfQLfQVfQ`fQfQfQfQfQgQgQgQgQgQdhQjhQhQhQhPiPiQiQiQiQiQiQiQiQjQjQjQkQkQkPRkPTkQXkPlkPnkQXlQhlQlQlQmQmQmQmQmQ@nQnQFoQ>pQTpPhpPpPpPpQqQqQqQqQ,qQ.qQ0qQ2qP4qP6qQ8qQ:qQ?@pCHI\]^_`abcdeijkpqrstuvwxyz{|}~hCompObjAj0TableKLkEGDocumentSummaryInformation8 1TableSummaryInformation(0DocumentSummaryInformation87#     !"$%&'()*+,-4/123M56789:;<=>?@ABCDEFGHIJTQNOPlUVWXYZ[\]^_`abcdefghinm.rqs~uvwxyz{|} Oh+'0X    ,8@HPss Sammy Sy ammNormaly Sammy Sy 271Microsoft Word 8.0@@T@(@eV  ^ ՜.+,D՜.+,, hp|   o0}s   Title(RZ _PID_GUID _PID_HLINKSAN{4B7A3114-7B44-11D6-96A8-00E0980A08CC}A |g7http://cutevoice.stanford.edu/~sammysy/google/src/web/  FMicrosoft Word Document MSWordDocWord.Document.89qJ0mailto:sammysy@cs.stanford.edu(     2t0 !"#$%&'()*+,-./51634>78K:;<=?JABCDEFGHIoLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnpqrstuvwxyz{|}~p.T'<(W4qXo$<Wz*>FyX[]ma8duf8gkklpv:$[mΦNt@zKi~::`X^|Jz|j2 HlHn$dT`  B"\$%@./~5yz8.#)28:;C\EI֤B n7=BIhPSRU\^B`pN||*JUjZ2biHolpuHz{prtuwxz{|~      !"#,-./0123456789:;GHIJKLMNOPQRSTUV<.G`igpqsvy}UnknownSammy SyG_STVTirXX358OQW:::?2$cw~Qk +2$$C9V/DX2$FJO~Q@ (  H  C A ?H  C A ?H  C A ?B S  ?rgkkir 4 4 4`SSWTTnn o oqrjr`STnn o oqrjrSammy SyC:\WINDOWS\Desktop\report.docSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery 3QQ ;QQ3_4E4D $% M 3O& Q4$% M 3O& Q4FA!= 3O!= 3 b+MZ43*#M! M4%  lYM3O& Q ,number of clusters n'4% ^UMZ3OJ& Q  compact(n)'4523  O43"  3;3O 3% M3OQ4444% KM03Ot&Q  query=Berkeley'44eeTt$? c?oT?N@a?鷯??߾?{P?o_?58EGr? &S? JY8? Dl?  -? JY8?   A  dMbP?_*+%"??UT0bbbb  T0T im ib T0T0bT0b0b~ ?Tt$?H}M?~ @ c?"uq?~ @oT?~jt?~ @N@a?l?~ @鷯?V-?~ @?߾?I.!?~ @{P?a4?~  @o_?+e?~ "@58EGr?o_?~ $@ &S? &?~ &@ JY8? &S?~ (@ Dl? F%uk?~ *@  -? MbP?~ ,@ JY8? -C6*?~ .@@   A  dMbP?_*+%" ??tU>@  A  dMbP?_*+%"??tU>@ `!⁅)$*SuD.8rGBIF " %C$*IAeی t^n)聅 Dd24.U=" +sK.ʋAefP:utsL ]T%RBinCKtU+ @R BbѤ2A@QD?㘓P;ay?)DYYE2*QJ)u1 }Xsave of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asd ZT %< 7< t K PZ hh.hh.hh.hh.hh.hh.%<ZTPZ7<t K@oToTsoTjTaiopJJKJLJVJ_J}JOOSSSSSSTTT TVT 0  # A2cw~Qk +2 `!cw~Qk +$#GJ*xZhU\5dṕ̠!2[mLcC(IfK$X, )mbdH-dVV{}}~y~ιxb@d~P Ы@cjB&IDi)ԟ'Q8OdYԞ|<a=Sw(͡\ڌfn}/!yt* I2SQrRq/%xIl [Ln_@x4:^SI?6aƮ؝ 2юAR+x}Ss|;"k /Wei٧|XK;8TA־,VӹڧُvG3L+9!RyfA^{ט )^጗R>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqbHbSbibbbbbbbbbbbbbbbcccc"c#c$c%c&c.c/c0c1c2c3c7c9c:c;cDcFc_c`cacbcccdchcjckclc~ccccccccccccccccccccccccccccccdd d!d"d#d'd)d*d+d.d/d0d1d2d3d7d9d:d;dEdLdMdNdPdRdTddddd@ezeeeefdfefffffgWgpgqgrgsgtgxg|gggggggggggggghh hhhh.h/h0h1h2hKhLhNhOhQhRhVhWhXhYhZh[hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"i#i$i%i&i'i+i,i-i.i/iViWiXiYiZi[i_i`iaibici|i}i~iiiiiiiiiiiiiiiiii0jFjOjojtjjjjjjkkkkkk k kkk(k)k*kWk_k~kkkkkl  que  A  dMbP?_*+%"??UT0bbjbb   bb  Jbb0b 0 0 0~ ??W[?-C6:?~ ?{?Mb@?~ @p_Q?ŏ1w-!_?~ @H}8g?g?~ @ -?y&1|?~ @K7A`?HPx?~ @H.?J4?~ @_vO?_L?@/@ F%u?~ @sF?V-?~ @TR'?d`T?~ @d]K?#~j?~ @~8gDi?ׁsF?~ @K7?ea?~  @Ǻ?u?~  @q -?a2U0*?~ "@z,C?'?~ "@48EG?2%䃞?~ $@ |?5^? HP? ~ $@ B? L7A`?~ &@ ?ܵ? Ǻv? ~ &@ 1w-!? _vO~?~ (@ !rh? ŏ1w-!_? ~ (@  rh? _vOf?~ *@ ? a2U0*C? ~ *@ ё\C? ǺF?~ ,@ ?ܵ|? -C6*? ~ ,@ S!uq? -C6*?~ .@EGr?~ ~ .@LJ?~ ~ 0@MbX?~ ~ 0@8d`?~ ~ 1@U?~ ~ 1@Zd;?~ ~ 2@/L F?~ ~ 2@MbX?~ ~ 3@)Ǻ?~ ~ 3@m?~ ~ 4@?~ ~ 4@( 0?~ ~ 5@"lxz,?~ ~ 5@%C?~ ~ 6@8d`?~ 6@?7@?2 nnnndnnnnnnnnnfffffffT(  p  6NMM?` ]`  A" ??3` Mh` Mh` Mh ` Mh пq3d23 M NM4 3QQ ;QQ3_4E4D $% M 3O& Q4$% M 3O& Q4FAgb 3O[ 3 b+MZ43*#M! M4%  MIM3O&Q ,number of clusters n'4% [!MZ3OJ&Q  compact(n)'4523  O43"  A,3O A% M3OQ4444% %VUM03Og&Q  query=doom'44eee >@  A  dMbP?_*+%"??tU>@  A  dMbP?_*+%"??tU>@ NsX#-a8%I"[±kX Պ*ֶ[Fӗa HEu4(^؊p!rIu"H\E)0‚@h_ª۔Q?஺zq GƢil*ɄP8Y~VK7"~m}AH%/:h1Օ,m ) ,n)M-X뜖a@'gjw k ת?D*JqE_",յx_H8w:W)M(Z)sBS(ۧ( ,V'鎸hJnb8<V.-umQ9W&\@9&tSzYGeRLE>#kv9W= y L*mZ,'?1"PY-]ҁQfYsYM؊$`9G 01001l?1n?1p?1?1?0?1I0J011@1B1D1X111116181:1@1^1h1100NR0X0$\1dm1hm1~m1m1m1m1n1n1nn1n0n0n1n1n1n0n0n1n0n1n0o1o0 o1"o0Fo0Ho1Jo0Lo1No0^o1`o0bo1do0fo1ho0po1to0vo1xo1o1o0o1o0o1o0o1o0o1o0o1o0o1p0p1p0p1p0p1p0p1p0Rp1Tp0Vp1Xp0Zp1\p0dp1hp0jp1lp0p1p0p1p0p1p0p1p0p1p0>q1@q0Bq1Dq0Fq1Hq0Pq1Tq0Vq1Xq0^q1`q0bqV/DXB: `!P$C9V/D (#GJ*xZ hU~wS R0C  ,6J\[Wm)6c(I0\*(EjR2)I (ZI#B? o=9wg}sy{''٘IY^XWz'yB.Bқ"K"8x~&2$NVKvEww7ơ}. 7#F4NL_ d[2O#3]3GdHņFc? [S|[ckSU܍yy|Mּ>IKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(.Dd 0  # A2FJO1dq0fq1hq0pq1tq0vq1xq1q1q0q1q0q0q0q1r1r1r1r1s1s1nt1t1t1u0u0u1v1v1v1v18w1w0w0w1w0w1w1w1w18x1Lx1tx0x0x1x1x0x0x1x0x0x1x0y1y0y1y08y1:y0^y0`y1by0dy1fy0y1y0y1y0y1y1y0y0y1y0y1y1y0>z1@z0Bz1Dz0Fz1Hz0Pz0Rz1Tz0Vz1Xz0`z1bz0zz0|z0~z0z1z0z1z0z1z0z1z0z1z0z0z1z0z1z0F{1H{0J{1L{0N{1P{0X{0Z{1\{0^{1`{0{1{0{1{0{1{0{0{1{0{1{0{1{0{1|0|1|0 |0|1|0|1|1(|16|08|1:|0<|0@|0B|0D|1F|1b}1}1}1}1}1~1 ~1~1 ~1~01 0 100011&100R0T1V111111Z1ր1:1z11΁10 1B1L1V1`111Ƃ  ABa=f =x8X1?Arial1?Arial1?Arial1?Arial1?Arial1?Arial1?Arial1?Arial1?Arial1?Arial1?Arial1?Arial1?Arial"$"#,##0_);\("$"#,##0\)!"$"#,##0_);[Red]\("$"#,##0\)""$"#,##0.00_);\("$"#,##0.00\)'""$"#,##0.00_);[Red]\("$"#,##0.00\)7*2_("$"* #,##0_);_("$"* \(#,##0\);_("$"* "-"_);_(@_).))_(* #,##0_);_(* \(#,##0\);_(* "-"_);_(@_)?,:_("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* "-"??_);_(@_)6+1_(* #,##0.00_);_(* \(#,##0.00\);_(* "-"??_);_(@_)                + ) , *  (0 (8 (0@ @ (0 @ (0@ (8@ t Chart3ySheet1/(Sheet2)Sheet3 TABLE;! TABLE_2;! TABLE_3;! TABLE_4; ZR3  @@  b8 Sheet_Titlerea bMMN 0Ir0bbM MMS0Lbt00MMM^n00b 00Ԣ0bbb0bbtb0bTABLE_4Percent0]y bΝ0\b0ZT0_T0=l0Te0T0T|?b 0|)bb@NJT0e0YT0L0b@b|0b~0BT0bb0bu0(1T00bbblb|0bZb&b0bZ 0  A"??3` Mh` Mh ` Mh ` Mh пq3d23 M NM4 3QQ ; QQ3_4E4D $% M 3O& Q4$% M 3O& Q4FAgb 3O[ 3 b#M43*#M! M4%  MIM3O& Q ,number of clusters n'4% [!MZ3OJ& Q  compact(n)'4523  O43"  ,3O % M3OQ4444% VM03O&Q  query=Sammy Sy'44eee> que  A  dMbP?_*+%"??U  T0 b b h    b b   bb    Jb b 0b   0 0 0 ~ ??W[?-C6:?~ ?{?Mb@?~ ? N@aÓ? a2U0*3?~ @p_Q?ŏ1w-!_?~ @H}8g?g?~ @ J +? H}M?~ @ -?y&1|?~ @K7A`?HPx?~ @ 3? S!uq{?~ @H.?J4?~ @_vO?_L?~ @ Gr鷿? _vO?@/@ F%u?~ @sF?V-?~ @ _vO? +eX?~ @TR'?d`T?~ @d]K?#~j?~ @ ) 0? \m?~ @~8gDi?ׁsF?~ @K7?ea?~ @ TR'? ı.n?~  @Ǻ?u?~  @q -?a2U0*?~  @ s? EԨ?~ "@z,C?'?~ "@48EG?2%䃞?~ "@ r? R!u?~ $@ |?5^? HP? ~ $@ B? L7A`? ~ $@ ^)? _vO?~ &@ ?ܵ? Ǻv? ~ &@ 1w-!? _vO~?  &@O@  HPsׂ?~ (@ !rh? ŏ1w-!_? ~ (@  rh? _vOf? ~ (@ %u? _vOn?~ *@ ? a2U0*C? ~ *@ ё\C? ǺF? ~ *@ .n? a2U0*S?~ ,@ ?ܵ|? -C6*? ~ ,@ S!uq? -C6*? ~ ,@ j+? a2U0*3?~ .@EGr?~ ~ .@LJ?~ .@?  -C6?~ 0@MbX?~ ~ 0@8d`?~  ~ 1@U?~ ~ 1@Zd;?~ ~ 2@/L F?~ ~ 2@MbX?~ ~ 3@)Ǻ?~ ~ 3@m?~ ~ 4@?~ ~ 4@( 0?~ ~ 5@"lxz,?~ ~ 5@%C?~ ~ 6@8d`?~ 6@?7@?2xfffffT(  p  6NMM?` ]`X  A"X??3` Mh` Mh` Mh ` Mh пq3d23 M NM4 3QQ ; QQ3_4E4D $% M 3O& Q4$% M 3O& Q4FAgb 3O[ 3 b#M43*#M! M4%  MIM3O&Q ,number of clusters n'4% [!MZ3OJ&Q  compact(n)'4523  O43"  ,3O % M3OQ4444% VM03O&Q  query=Sammy Sy'44eee >@   A  dMbP?_*+%" ??tU>@  A  dMbP?_*+%"??tU>@ ܝąyY:9]MQK~XOjQL`D!RQG@vq[|Y\B~=b\rF(nۿ(hHEt>aR{(~k @ (44O VO(Ĥwm9`MחKYe$ՁJ@sx('Se)H #EZݘ@zcD @~vR u e"I/Ӝ"Nr:Σ-J`QZ谊 ]] Q]^ӄ i: RRrQvZ WYBUJK[ P4໷'V."Iܝı !***@! cpPQueP($_A V?Qr>0SxNmT)0/$RK򊺐:?O!QDnBWQm.a2pE7"?Pz)̐09*NEV11111Ѓ11d1j11100111111ʅ11111110R0T1X0l0n1X1h111ĉ1ȉ1։1؉11@11F1>1T1h10"0ҍ1(1*1,181T1V1X1Z0\0^1`1b1d1p11110011111č1ƍ1ȍ1ʍ0̍0΍0Ѝ0ҍG:Times New Roman5Symbol3& :Arial?5 :Courier New"1h8HfLf`^0Q :}#~v`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې'c3==L%UL%אOjۙiomZca<>ED:xy4dsESammy SySammy SyeԍNo@ɱs-`ͤA>-]O>nAN 10~v/0VɩťTږxbUx}|}1BЯ֢bij89㪁I{L_SuFb~?婩-4mgፌ?mpMlEN#sQyCoxd(kϴIgV~~fO|`|O@mkj'5LNeTZXy:O]Q13= }s*5Q#]Rp~oa艷=څ=µOϾwᤰև;pXrs_HX+l+N bqIKvl>LQYq\1Zif$S(eY& 3-hxUV*-iW{f\+~ˉا(tg3 P[JW]攩T֨|I-Y MъAN_\٪4ە:`J+a1~?riMϞu)yNog<FFCreaEsQRc`;Xc50X c%2VȘ_^30ûdRja&ƶ3g(cG;.0v]e6c=B5f U4mZу6[(*ZOlv{AP; UTua){& UT |bƧʞH6Up_ଲGREK.,R;M tϐ+*TݑoHm A𗲻RE ؆;춀TǦj 6LQ5 UDz))AVPvK@hgSH4b W1 UϴZ& F# HtmVxE/+> U׶*QD1 eݾMukQT'&Qw1V-e\XOp<ᵛ6vgUr~f%0Ey[H>ZyՕyt7 ; cډu5,P̕llu57ouw1 E x5Okkqn1hf>D-j ӺhYTt7+#Fřr&gaq8SsZYbqyƹřjj _j(L vqP Pݺ78I6Ŋhmy._/#$_3pJCoِ12aO sU>rd.s|.kh7(`!~FJO~ #GJ*Lx͚]lTEs- }AӖ( IMu#//'RY$ׁ҆H?8}[Sm| Gs*904 ʎ۹z;b-xҏ߳RQq4tv3nU<'/[eU+9*mcK3lTؖ{uBME3%tD}|"*4alԵmNL%&Kyޔ~>؃it|Ε:)ʖaee0kaČr kՏ?8~ʥu8{>KurѪDM\SŐc/Z7(d%ׂab6lYYY fM̚q 0wm`ά^f{fvif]dv f72esbVǬjfmd.f+jV0z]xASe'm_'EXcc#uLuW3r1i̴\ ̴\ jfZs53-˜fEs5kyJ˛iN/kfZZLu^Lydv^u[[j͢{fv5f'U<&[X45FWT=6z1 oѭHH^Qڼei|~%/ Wq 'A#8䥐9l<._ 甼WؼeEY3:hyu[ pK!yE+ͫ1]1QHH^Zn*%qӕ<Wؼ'c 0W!yEͫ+i2%CvTWk6юWۼ"ZUƫhS5$º-Hc\?^M.Ǿfiί7VA ߵ?xTʆd ޕMw<ܚ6$wq}{{H7$h 㝉c?ZQ?`*ߡQ} ՉmS ycaLsuz|Vksa"\* ٬HCt+8@i , s, E[, ;<#IRuUꃬc A&5YǖT6o~g+I]yKυvd3;Զmvv;:kwTZzUfB55TG0~|g1ې2   34[KLMNOPQRSUVWXYZf\]^_`abcdeglijkmnpqrstuvwxyz{|}~DocumentSummaryInformation8 1TableKSummaryInformation(0DocumentSummaryInformation87#     !"$%&'()*+,-4/123M56789:;<=>?@ABCDEFGHIJTRKNOPlUVWXYZ[\]^_`abcdefghinm.rs~uvwxyz{|}JhCompObjAj0TableK Oh+'0X    ,8@HPss Sammy Sy ammNormaly Sammy Sy 272Microsoft Word 8.0@@T@(@eV ^ ՜.+,D՜.+,, hp|   o0s   Title(RZ _PID_GUID _PID_HLINKSAN{4B7A3114-7B44-11D6-96A8-00E0980A08CC}A |g7http://cutevoice.stanford.edu/~sammysy/google/src/web/  FMicrosoft Word Document MSWordDocWord.Document.89qJ0mailto:sammysy@cs.stanford.edu""""""""""""""""""""""""""""""""""""""" : : :: : : :(p.T'<(W4qXԍo$<Wz*>FyX[]ma8duf8gkklpv:$[mΦNt@zKi~::`X^|Jz|j2 HlHn$dT`  B"\$%@./~5yz8.#)28:;C\EI֤B n7=BIhPSRU\^B`pN||*JUjZ2biHolpuHz{ԍprtuwxz{|~      !"#,-./0123456789:;GHIJKLMNOPQRSTUV<.G`igpqsvy}UnknownSammy SyG_~qX358OQW:::?2$cw~Qk +2$$C9V/DX2$FJO~Q@ (  H  C A ?H  C A ?H  C A ?B S  ?fjj~q 4 4 4m n"n"n q%q'q'qqm n"n"n q%q'q'qqSammy SyC:\WINDOWS\Desktop\report.docSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asdSammy Sy/C:\WINDOWS\TEMP\AutoRecovery save of report.asd ZT %< 7< t K PZ hh.hh.hh.hh.hh.hh.%<ZTPZ7<t K@'q'qkd'q'qaiopJJKJLJVJ_J}JOOVXFaHaSa]aha~aaaaaaaaaaaaa b bbb$b%b7b8b9b:b;bCbDbEbFbGbHbLbNbObPbYb[btbubvbwbxbyb}bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb3c4c5c6c7c8cc?c@cCcDcEcFcGcHcLcNcOcPcZcacbcccecgcicccddUdddddeyezeeee f0flffffffffffffffffffgg ggg"g#g0g1gCgDgEgFgGg`gagcgdgfgggkglgmgngogpg}ggggggggggggggggggggggghhhhhh h h h7h8h9h:h;hj?jljtjjjjjj1kQk\k{kkkkkkkkkkl#lclrl|lllll mmXmYmZm]m_mgmomymmmm n n"n#n=n>n@nJnKnnnnnvoxooooop7pppp q%q'q(q)q*q0q>q?q@qAqBqCqDqEqFqLqZq[q\q]q^q_q`qaqbqhqvqwqxqyqzq{q|q}q~qPتQQPQPPQl?Qn?Qp?Q?Q?P?QIPJPXP$\QdmQhmQ~mQmQmQmQnQnQnnQnPnPnQnQnQnPnPnQnPnQnPoQoP oQ"oPFoPHoQJoPLoQNoP^oQ`oPboQdoPfoQhoPpoQtoPvoQxoQoQoPoQoPoQoPoQoPoQoPoQoPoQpPpQpPpQpPpQpPpQpPRpQTpPVpQXpPZpQ\pPdpQhpPjpQlpPpQpPpQpPpQpPpQpPpQpP>qQ@qPBqQDqPFqQHqPPqQTqPVqQXqP^qQ`qPbqQdqPfqQhqPpqQtqPvqQxqQqQqPqQqPqPqPqQrQrQrQrQsQsQntQtQtQuPuPuQvQvQvQvQ8wQwPwPwQwPwQwQwQwQ8xQLxQtxPxPxQxQxPxPxQxPxPxQxPyQyPyQyP8yQ:yP^yP`yQbyPdyQfyPyQyPyQyPyQyQyPyPyQyPyQyQyP>zQ@zPBzQDzPFzQHzPPzPRzQTzPVzQXzP`zQbzPzzP|zP~zPzQzPzQzPzQzPzQzPzQzPzPzQzPzQzPF{QH{PJ{QL{PN{QP{PX{PZ{Q\{P^{Q`{P{Q{P{Q{P{Q{P{P{Q{P{Q{P{Q{P{Q|P|Q|P |P|Q|P|Q|Q(|Q6|P8|Q:|P<|P@|PB|PD|QF|Qb}Q}Q}Q}Q}Q~Q ~Q~Q ~Q~PQ P QPPPQQ&Q0PRPTQVQQQQQQZQրQ:QzQQ΁QP QBQLQVQ`QQQƂQQQQQЃQQdQjQQQPPQQQQQQʅQQQQQQQPRPTQXPlPnQXQhQQQĉQȉQ։Q؉QQ@QQFQ>QTQhQP"P&Q(Q*Q,Q8QTQVQXQZP\P^Q`QbQdQpQQQQPPQQQQQčQƍQȍQʍP̍P΍PЍPҍG:Times New Roman5Symbol3& :Arial?5 :Courier New"1h8HfLf ^^N]/Q :}#4drEC=Sammy SySammy Syha1 f&ܦIM$9QZ{4ưB**DMDg(((ާloLMίӜ(wl Eus t-\!ĕv&bG,V @NNO!%pN^EƺjRXSe t`?/1@Jk]$]bH2 A B SAkQ/m7r` cU (T37 +m Z -baˑMg\}:N*=8N%QB@(Ɲq:`,-Qh2±`n?IYcE+ݜ@"zc },P`4aMsY((6xPģ@uBG |pbjbjَ (qU]...B&&&&Lh&B:G****(+--- 5 5 5 5 5 5 5$0H$JJ15 .--^---15Q.*(+W*(Q.Q.Q.-:l*:.(+ 5BB- 5Q.VQ.2}4|,. 5(+|*4 B#&.:4Root Entry F Data WordDocumentObjectPool3D2    !"#$%&'()*+,-./01534[6789:;<=>?@A\]^_`abcdeijkpqrstuvwxyz{|}~CompObjAj0TablenJ#     !"$%&'()*+,-4/123M56789:;<=>?@ABCDEFGHIJTNOPlUVWXYZ[\]^_`abcdefghinm.rs~uvwxyz{|}hDocumentSummaryInformation8 1TableJSummaryInformation(0DocumentSummaryInformation87` Oh+'0X    ,8@HPss Sammy Sy ammNormaly Sammy Sy 269Microsoft Word 8.0@?T@(@H ^N] ՜.+,D՜.+,, hp|   o/r   Title4(RZ _PID_GUID _PID_HLINKSAN{4B7A3114-7B44-11D6-96A8-00E0980A08CC}AxJ0mailto:sammysy@cs.stanford.edu  FMicrosoft Word Document MSWordDocWord.Document.89qCompObj-bObjInfo,Workbook!SummaryInformation(( !"#$%&')*+.12345689:;<=>?@B_1085140626 F܃ ܃ _1085142068 F܃ ܃ _1085142192 F  Ole WorkbookoJ&SummaryInformation(DocumentSummaryInformation8Ole /DocumentSummaryInformation8Ole CompObj bObjInfoCompObj bObjInfo Workbook *SummaryInformation( 0  ՜.+,D՜.+,@ PXd lt| 3  Sheet1Sheet2Sheet3Chart3  WorksheetsCharts 6> _PID_GUIDAN{2251DF20-7BBB-11D6-96A8-00E0980A08CC} Oh+'0@H| )Project Enhanced Google - Search Results  Sammy Synha Sammy SynhaMicrosoft Excel@ٴ FMicrosoft Excel ChartBiff8Excel.Sheet.89q ՜.+,D՜.+,@ PXd lt| 2  Sheet1Sheet2Sheet3Chart2  WorksheetsCharts 6> _PID_GUIDAN{2251DF20-7BBB-11D6-96A8-00E0980A08CC} Oh+'0@H| )Project Enhanced Google - Search Results  Sammy Synha Sammy SynhaMicrosoft Excel@ٴ FMicrosoft Excel ChartBiff8Excel.Sheet.89q ՜.+,D՜.+,@ PXd lt| 1  Sheet1Sheet2Sheet3Chart1  WorksheetsCharts 6> _PID_GUIDAN{2251DF20-7BBB-11D6-96A8-00E0980A08CC} Oh+'0@H| )Project Enhanced Google - Search Results  Sammy Synha Sammy SynhaMicrosoft Excel@ٴ FMicrosoft Excel ChartBiff8Excel.Sheet.89q