ࡱ> jliE -Wjbjb -QlZZZZZZZ*****>*G, p2Z2ZZZZn^^ZZZZd0ZZ0,p**00GG0o o 0Brendan OConnor and Michael Bieniosek Linguistics 138; December 6, 2002 Prof. Martin Kay Final Project: Random Sentence Generation with n-grams We developed a random sentence generator that trained itself on a corpus of natural written language, and then could generate sentences based upon that corpus. It worked by doing statistical analysis of trigrams or n-gram tuples of sequential words. When it generated words, it would pick the next word in the sentence by choosing a word that in the corpus often came after the previous n 1 words. This was initially implemented by using a Python tuple of the first n - 1 words as a key in a global dictionary whose value was a probability table of all the words which had followed those sequences in the corpus and their respective frequencies. Imagine we were to parse the following sentence into triples (for purpose of this example, suppose the sentence appears entirely in lowercase): the brown dog jumps over the brown fox In the original tuple-keyed implementation, each initial word pair would be used as a key pointing to all possible succeeding words. The key is simply a Python tuple of strings. (*, *) -> the=1 (*, the) brown=1 (the, brown) dog=.5, fox=.5 (brown, dog) jumps=1 (dog, jumps) over=1 (jumps, over) the=1 (over, the) brown=1 (brown, fox) .=1 However, this approach was more useful when n was small. A different implementation was used in order to generalize the algorithm for n-grams of n > 3. A tree of initial n - 1 words was used to point to the frequency tables in the data structure. Thus, the representation of the training corpus consisted of a dictionary of possible initial words for each sequence, with each word pointing to a list of possible following words, and so on until reaching the frequency table at the n 1th word. In the tree implementation, instead of using tuples for keys, the first 2 words are made into a tree of possible word sequences. This tree contains every double in the text, which points to a list of all words which follow the double (in this example some initial words have been omitted). * the brown (etc.) |------------| | |-----------| * the brown dog fox | | | | | (the=1) (brown=1) (dog=.5 (jumps=1) (.=1) fox=.5) This makes it easier to implement random sentence generation for n-grams of arbitrary length. It also permits analysis of the tree structure, since a procedure could simply walk through the tree and gather statistics. We did a version of this for sparseness analysis; see below. FINDINGS 1) Surprising effectiveness We discovered that the random sentences had remarkably grammatical syntax, considering the manner in which they were generated. For example, one sentence generated by digrams was """We have been nebulous as it ran towards the Istiqlal's "privileged position" until it was speaking, metaphorical, but have model ~R Dietetic Laboratories, which might do everything he was loud in frankfurter in the prelude to be.""" While this sentence has no real semantic meaning, its syntax is perfect. Subject-verb agreement in naturally produced language usually does not seem to require more than two words of context: the verb and the word immediately preceding it, i.e., the noun. This suggests that English does not have very many syntactical long-distance dependencies and can be used by speakers who only remember the last few words they have spoken. Sentences that do have long distance dependencies for subject-verb agreement, such as "The man that they saw when buying the pastries was walking to the bus stop", seem to be somewhat uncommon. The generator did seem to have some problems with putting one verb in each sentence, though it tended to produce sentences without verbs more often than sentences with multiple verbs. One would expect that many noun phrases can appear both as a subject or as an object in different sentences, so the generator merely forgot where it was and tried to make the noun phrase do both. It is not clear how other languages would behave in this method of sentence generator. French would probably do only slightly worse. Like English, French verbs can only be separated from their subject by a small number of words. Gender might also be based mostly on close-distance dependencies and so would not pose too much trouble to the generator. A language which required case agreement with its verb might have more trouble forming correct syntax, especially when the verb is not immediately preceding or following that noun phrase. A language like Latin would probably not work so well simply because it would be confused by different cases of words. 2) Sparseness Using the Brown corpus as a training text, the generator did well up to 3 and 4 grams. However, sparse data became a problem after that, with most 5-gram sentences taken directly from the training text. The problem was, there were not enough common n 1 tuples shared between multiple sentences, so when the generator was walking through the tree, it had only one choice at each probability table. We extended the corpus analyzer to include metadata where each word came from, and save it within its probability table. We were then able to analyze generated sentences for which sentences each n-gram came from, and where the combinations between different sentences took place. Heres sample formatted output: ------------------------------------------------------------ He has served in the Kremlin this morning to be honest and morally proper. ? b He : n/a has : n/a served : n/a in : continuous.. the : Continuity break over the n-1-gram: ['served', 'in'] Kremlin : Continuity break over the n-1-gram: ['in', 'the'] this : Continuity break over the n-1-gram: ['the', 'Kremlin'] morning : continuous.. to : continuous.. be : continuous.. honest : Continuity break over the n-1-gram: ['to', 'be'] and : continuous.. morally : continuous.. proper. : continuous.. The n-1-grams above are the shared words that bridge together different sentences from the corpus: 1) He 2) served A02 1680 of education at Fort Hays, Kan&, State College. He has served B07 1660 He has served in positions of greater glamour, both at home and abroad; 3) in B07 1660 He has served in positions of greater glamour, both at home and abroad; 4) the A08 0100 1945, Pfaff served in the Army Air Corps. While in the service 5) Kremlin A07 1340 _MOSCOW, JUNE 18_- At a gay party in the Kremlin for President B20 0920 with Lenin to less distinguished quarters in the Kremlin wall is not 6) this Premier Khrushchev expected the millions looking toward the Kremlin B07 1110 this morning to be filled with admiration or rage- depending 7) morning Premier Khrushchev expected the millions looking toward the Kremlin B07 1110 this morning to be filled with admiration or rage- depending 8) to B07 1110 this morning to be filled with admiration or rage- depending 9) be B07 1110 this morning to be filled with admiration or rage- depending 10) honest B02 1050 ills. Both men are known to be honest and public-spirited. Mayor 11) and B02 1050 ills. Both men are known to be honest and public-spirited. Mayor C16 1520 and be honest and morally proper. All in all, Montgomery calls for a 12) morally C16 1520 and be honest and morally proper. All in all, Montgomery calls for a 13) proper. C16 1520 and be honest and morally proper. All in all, Montgomery calls for a This is where the randomness of the semantics comes in. In sentence B07 1660, he refers to someone serving in a military service. From A08, the trigram is plucked out of a sentence about a certain person in the Army Air Corps. But the object of the preposition gets completely changed with the A07 & B20 sentences: in the Kremlin. The sentence jumps from the U.S. military to the Soviet political elite over the n-1-gram in the. Whats important is that the grammatical integrity is preserved because the structure is fundamentally infix. served in the Kremlin (verb prep article noun) pivots over the grammatically significant in the. In English, the prep-article structure is infix with respect to the verb its modifying, and the noun within its prepositional phrase. This is why we speculate the generator would fail on languages without similar the infix-significant constructions, such as German where many verbs end up being postfix-significant: Er hat in dem Kremlin gearbeitet would be impossible to construct with n-gram methods. (And never mind the conjugation or case changes!) We also ran statistical analysis on sparseness. Much to our non-surprise, we found that with bigger n-grams and smaller corpuses, sentences tended to have higher continuity that is, tended to come all from the same original sentence. [All sentences here coming from the same corpus:] n2345Average words per sentence20.322.628.828.8Average continuity breaks per sentence13.37.82.80.7Percentage of sentences that have no continuity breaks (are straight copied from the corpus).2%7%30%76% Further possibilities include running n-gram analysis on other languages, through different and bigger corpuses, and using it on a tagged corpus, with or without some sort of tag-significant word choice algorithm, in order to analyze the sentence structures it creates. With more time, n-gram analysis can reveal facts about language. Source Code conf.py #This must be import'able python code. #Each must be a list of filenames to read in. import glob icame = glob.glob( "/afs/ir/data/linguistic-data/Brown/ICAME-Brown1/brown1_[a].txt") tagged = glob.glob( "/afs/ir/data/linguistic-data/Brown/Treebank3-Tagged-Brown/c[q-r]/*") small = ["smallbrown"] import inputclasses FILENAMES = icame INPUTCLASS = inputclasses.ICAMEInput #FILENAMES = tagged #INPUTCLASS = inputclasses.TreebankTaggedInput inputclasses.py from __future__ import generators import re,string class Metadata(object): def __init__(self, pos, tag): self.pos = pos self.tag = tag EMPTY_METADATA = Metadata(('A00 0000', 42,42), None) class ICAMEInput(object): """Looks like: A01 0010 The Fulton County Grand Jury said Friday an investigation A01 0020 of Atlanta's recent primary election produced "no evidence" that A01 0030 any irregularities took place. The jury further said in term-end """ def __init__(self, infile): self.infile = infile def __iter__(self): wordcount = 0 for line in self.infile: posstr = " ".join(line.split()[:2]) toks = line.split()[2:] for horizpos, tok in zip(range(len(toks)), toks): wordcount += 1 yield tok, Metadata((posstr, horizpos, wordcount), None) if len(tok)>0 and tok[-1] == ".": #assume end of sentence #Then begin a new sentence: yield '#', EMPTY_METADATA rsg.py from __future__ import generators class Node: """Represents a bunch of possible next-words for one word-as-a-node Doesn't include the word itself; rather, it's associated with the word in a key:value relationship elsewhere These only live as leafs at the bottom of the tree """ def __init__(self): self.n = 0 self.dict = {} #word : freq self.list = None self.metadata_dt = {} #word : metadatas def add_child(self, child, metadata): self.n += 1 if self.dict.has_key(child): self.dict[child] += 1 self.metadata_dt[child].append(metadata) else: self.dict[child] = 1 self.metadata_dt[child] = [metadata] def finalize(self): self.list = [] counter = 0 for word, freq in self.dict.iteritems(): counter += freq self.list.append((counter, word)) def get_word(self, rand): n = rand * self.n if not self.list: self.finalize() item_bottom = 0 item_top = len(self.list) - 1 while item_top > item_bottom + 1: item_middle = (item_top + item_bottom) / 2 if n < self.list[item_middle][0]: item_top = item_middle elif n > self.list[item_middle][0]: item_bottom = item_middle else: return self.list[item_middle][1] return self.list[item_top][1] SENTENCE_DELIM= '#' def build_dict(n, token_source, n_tuple_dict = None): """Will build upon the n_tuple_dict tree, and return a new, updated tree""" if not n_tuple_dict: n_tuple_dict = {} starting_ngram = [SENTENCE_DELIM for i in xrange(n)] count=0 cur_ngram = starting_ngram for tok, metadata in token_source: count += 1 cur_ngram = cur_ngram[1:] + [tok] curnode = n_tuple_dict for i in xrange(n - 1): #Populate the n-1-length path with {}'s and a Node if not curnode.has_key(cur_ngram[i]): if i == n - 2: curnode[cur_ngram[i]] = Node() else: curnode[cur_ngram[i]] = {} curnode = curnode[cur_ngram[i]] #curnode is now actually a node, not a dict curnode.add_child(cur_ngram[n - 1], metadata) if tok == SENTENCE_DELIM: cur_ngram = starting_ngram[:] return n_tuple_dict, count def get_choosing_node(dict, sequence): """Traverses a sequence of words, length=n, down the tree. It returns the bottom node, which is a Node node, that contains a probability table.""" working = dict for i in xrange(len(sequence)): if working.has_key(sequence[i]): working = working[sequence[i]] else: raise Exception, "word " + sequence[i] + " in sequence can't be found" return working import random import time random.seed(time.clock()) MAXWORDS=2056 def random_sentence(dict, n): """yields word, positionlist. Uses the dict as the big tree, and n as the length of the n-gram""" startwith = [SENTENCE_DELIM for i in xrange(n-1)] newword = 'guten Tag' wordcount = 0 while newword != SENTENCE_DELIM: endnode = get_choosing_node(dict, startwith) newword = endnode.get_word(random.random()) if newword is SENTENCE_DELIM: yield ".", [('A00 0000', 0, 0)] else: poslist = [metadata.pos for metadata in endnode.metadata_dt[newword]] yield newword, poslist startwith = startwith[1:] + [newword] wordcount += 1 if wordcount > MAXWORDS: yield ".", [('A00 0000', 0, 0)] break main.py #!/usr/local/bin/python -i """Usage -n num n-gram length (default 3) """ from cmd_line import command_line import sys import os import rsg import conf import inputclasses import pprint def num_cbreaks(sent): """Counts the number of continuity breaks in the given sentence. sent: list of (word, its position list) """ breaks = 0 for i in range(length, len(sent)): end_of_prev_ngram = sent[i-1] word,posls = end_of_prev_ngram prev_absolute_wordpositions = [pos[2] for pos in posls] end_of_ngram = sent[i] word, posls = end_of_ngram cur_absolute_wordpositions = [pos[2] for pos in posls] for cur_absolute_wordpos in cur_absolute_wordpositions: if cur_absolute_wordpos - 1 in prev_absolute_wordpositions: break #No continuity break! else: breaks += 1 #print "%-25s: Continuity break -----" %word return breaks def do_stats(num_sents, benchmarkincr=.05, status=1): """Generates a lot of sentences, and displays statistical info num_sents: number of sentences to run the analysis on benchmarkincr: for progress indicator status: boolean, whether or not to show the progress indicator """ global length total_breaks = 0 total_words = 0 total_nobreaks = 0 lastbenchmark = 0.0 for i in xrange(num_sents): if status: if 1.0 * i / num_sents > lastbenchmark + benchmarkincr: print "%d%% done, %d sentences analyzed" %(100.0 * i / num_sents, i) lastbenchmark += benchmarkincr sent = list(rsg.random_sentence(data, length))[:-1] num_breaks = num_cbreaks(sent) if num_breaks == 0: total_nobreaks += 1 total_breaks += num_breaks total_words += len(sent) avg_words_per_sent = total_words * 1.0 / num_sents avg_breaks_per_sent = total_breaks * 1.0 / num_sents breaks_per_word = total_breaks * 1.0 / total_words perc_total_nobreaks = total_nobreaks *1.0 / num_sents print "------------------- Results -----------------------" allvars = locals(); allvars.update(globals()) print """ length=%(length)s num_sents=%(num_sents)s perc_total_nobreaks=%(perc_total_nobreaks)s #Straight-copied sentences; indicator of sparseness avg_words_per_sent=%(avg_words_per_sent)s avg_breaks_per_sent=%(avg_breaks_per_sent)s breaks_per_word=%(breaks_per_word)s """ % allvars def show(linetag): """hard-coded to work with the ICAME-Brown1 corpus on the leland systems""" fgrepable_linetags = linetag s = os.popen("fgrep --no-filename -C 10 '%s' /afs/ir/data/linguistic-data/Brown/ICAME-Brown1/*" %fgrepable_linetags).read().replace('\n', ' \n') s = s.replace(linetag, '\001%s\033[31m%s\033[0m\002' %(ANSIBOLD,linetag)) return s def clines(linetags): """hard-coded to work with the ICAME-Brown1 corpus on the leland systems""" fgrepable_linetags = '\n'.join(linetags) return os.popen("fgrep --no-filename '%s' /afs/ir/data/linguistic-data/Brown/ICAME-Brown1/*" %fgrepable_linetags).read().replace('\n', ' \n') ## --------------------- Begin "main" area ----------------------------- flags = command_line('h', __doc__, [], ['n']) flags.read() strlen = flags.switch('n') if strlen: length = int(strlen) else: length = 3 # Read-in data = {} wordcount = 0 for i in conf.FILENAMES: print "Processing file", i infile = open(i) gen = conf.INPUTCLASS(infile) data, wordsread = rsg.build_dict(length, gen, data) wordcount += wordsread print "%s total words processed" %wordcount ANSIBOLD=os.popen("tput bold").read() print "h for help. Control-D to exit" while 1: opt = raw_input("? ") if opt=='h': print """ [Enter]=new sentence; s=show sentence; a=all positions; c=context positions; b=continuity analysis""" elif opt=='s': print ' '.join([itm[0] for itm in sent]) elif opt=='a': pprint.pprint(sent) elif opt=='c': for i, (w, posls) in zip(range(len(sent)), sent): print "%s) %s "%(i,w) linetags = [pos[0] for pos in posls] ctxtlines = clines(linetags) s = ctxtlines s = '\t' + s.strip().replace('\n','\n\t') + '\n' s = s.replace(" %s " %w, '\001%s\033[31m %s \033[0m\002' %(ANSIBOLD,w)) print s, print elif opt=='b': words = [w for w,posls in sent] breaks = 0 for i in range(0,length): end_of_ngram = sent[i] word, posls = end_of_ngram print "%-25s: n/a" %word for i in range(length, len(sent)): end_of_prev_ngram = sent[i-1] word,posls = end_of_prev_ngram prev_absolute_wordpositions = [pos[2] for pos in posls] end_of_ngram = sent[i] word, posls = end_of_ngram cur_absolute_wordpositions = [pos[2] for pos in posls] for cur_absolute_wordpos in cur_absolute_wordpositions: if cur_absolute_wordpos - 1 in prev_absolute_wordpositions: print "%-25s: continuous.." %word break #No continuity break! else: print "%-25s: Continuity break over the n-1-gram: %s " \ %(word, words[i-(length-1):i]) elif opt=='': print '-'*60 sent = list(rsg.random_sentence(data, length))[:-1] print ' '.join([itm[0] for itm in sent]) else: #read-eval-print loop by default print eval(opt) [ ! 5 % & IJ 1dtA~ERdr/ < !d!o!!!""/";"""""K#W#####I$[$$$$(')'a*c*ʽʽʽʽʽʽʽʽʽʽʽʽʽʽʽʽʽ5B*CJOJQJph65B*CJOJQJphB*CJOJQJph CJOJQJ 5B*ph CJOJQJB*OJQJphOJQJ B*ph5A'IZ[%& ! " 4 5 % & $a$$a$'IZ[%& ! " 4 5 % & Id  1cdt@A~ *RAi-U}~S  b m !!p!!!"l"r"""##b& Id  1cdt@A~~ *RAi-U}~S  b m m !!p!!!"l"r"""##s#{##%$1$$$$$B).*/*a*c*e*$If#s#{##%$1$$$$$B).*/*a*c*e*g*i*k*l*************3+7+:+>+B+C+D+E+,,,,,,,,--#-i-}-----.*.>.m.n.~....... /#/$/Y/Z/t///0o0w00000131W1111G2w2222222  ae*g*i*k*l******}{$$Iflr3 k ;  8    4 la$If c*l*****3+C+D+,,,,,m.~.222WAXA`AaA,W-Wø B*phB*CJOJQJph 5B*ph B*CJphB*CJOJQJphB*CJOJQJph5B*OJQJphB*CJ OJQJphOJQJB*OJQJph5B*CJOJQJphB*CJOJQJph*******}}}}}$If{$$Iflr3 k ;  8    4 la**3+7+:+>+B+C+D+E+,,,,,}}}}}{yy{{{{$If{$$Iflr3 k ;  4 la,,,,--#-i-}-----.*.>.m.n.~....... /#/$/Y/Z/Z/t///0o0w00000131W1111G2w22222222%3q33322%3q333333 4.4G4w4444415?5`5555556*6X6v666667.7e77778&8W8}8~88889D9}9~99999:2:3:::: ; ;O;{;;;<1<P<Q<x<<='=K=t===>>>>#>/>I>J>X>v>>>?1?C?h????@,@:@ d3333 4.4G4w4444415?5`5555556*6X6v666667.7.7e77778&8W8}8~88889D9}9~99999:2:3:::: ; ;O;O;{;;;<1<P<Q<x<<='=K=t===>>>>#>/>I>J>X>v>>>?1?1?C?h????@,@:@@@@@AEAWAXA`AaA|A}AAAAAAAAAA:@@@@@AEAWAXA`AaA|A}AAAAAAAAAAB!B"B#B:BBBBBBBBC8CxCyCCCCC;DDDDDE E!E"EXEEEEFCFKF]FrFFFFFF,GGGGHCHfHHHH6IqIIIIJJ{JJJJKKKgKKLkLxLyLLL MMMM dAB!B"B#B:BBBBBBBBC8CxCyCCCCC;DDDDDE E!E"E"EXEEEEFCFKF]FrFFFFFF,GGGGHCHfHHHH6IqIIIIIJJ{JJJJKKKgKKLkLxLyLLL MMMMMN&N'NBNcNzN{NMMN&N'NBNcNzN{NNNNNNN ODO_OOOOOOOOOP!P7PHPYPnPPPPPPPQHQjQQQQRoRRRRRRRS&SMSrSSSSS7T8T[TTTT UVUUUUVNV`VuVVVVW-WO{NNNNNNN ODO_OOOOOOOOOP!P7PHPYPnPPPPPPPPQHQjQQQQRoRRRRRRRS&SMSrSSSSS7T8T[TTTT U UVUUUUVNV`VuVVVVW-W / =!"#$% i4@4NormalCJOJPJQJmH <A@<Default Paragraph Font<B`< Body TextB*CJOJQJph4P`4 Body Text 2 B*phBT`B Block Text]^ B*ph-Q !z z z z!z!z!z!z! zJeJ &w.#8?I-Q * :'c*-W-5& ~m e***,Z/3.7O;1?A"EI{NP U-W.01246789;<=>@ABDEF#2:@M-W/3:?C&_grxW\zSY  ! u{~&&&&''' 'r'{''''''(()(L(l(n(}(((()))`)j)***********+++!+++?+C+F+P+g+o+q+t+++++++++++++++++,,,,,",,,z---...6.?.O._.m.v.........//K/T/l/|//// 060F0`0h0o0s00000000000000000111(1:1E1I1Q1T1_1x111111111111111222=2F2G2R2f2o2p2x222222222$30323>3H3V3m3n3r3x3333333333334 4 444%414?4@4D4J44444444444444445;5<5E5F5G5[5b5e5l5m5v5w5x55555555555666-6<6H6U6f6g6k666"7&7/70747:7;7>7V7e7o7p77777/8:8;8E8\8k8l8p888888899 999"9&9+959>9M9T9p9w9z99999999999999F:M:Q:]:n::::::::::::::::; ;X;_;d;g;z;{;;;;;;;;;<<< <'<2<<<<<<==#=&=7=@=[=q=v=============>>>9>J>^>f>>&?.?/?8?:?G????? @@a@m@v@@@@@@@@@@@@@@A A AAA*AoApAsA|A~AAAAAAAAAAB BB(B/B=BKBWB[BeBnByB~BBBBBBBBBBBBBBB CC!C*C5C:CMCQC_CgCpCCCCCCCD DDDD.D1DDD{DDDDDDDDDDDDDE EEUE[EkE}EEEEEEEEE%F.F/F6FXFhFFFFFFFFGG GGG!G&GnGGGGH#H'H-H0HPJPKPRPVPPPPPPPPP#Q'Q/QAcademic ComputingBAComp HD (2001-02):Temporary Items:AutoRecovery save of We developAcademic ComputingBAComp HD (2001-02):Temporary Items:AutoRecovery save of We developAcademic Computing@AComp HD (2001-02):Desktop Folder:We developed a random sentenceAcademic Computing7AComp HD (2001-02):Temporary Items:Word Work File A_982Academic Computing7AComp HD (2001-02):Temporary Items:Word Work File A_982Academic Computing@AComp HD (2001-02):Desktop Folder:We developed a random sentenceAcademic Computing8AComp HD (2001-02):Temporary Items:Word Work File A_1518Academic Computing?AComp HD (2001-02):Desktop Folder:rsg-writeup-bien-brendano.docAcademic Computing?AComp HD (2001-02):Desktop Folder:rsg-writeup-bien-brendano.docAcademic Computing?AComp HD (2001-02):Desktop Folder:rsg-writeup-bien-brendano.doca$c$e$g$i$k$l$$$$$$$$$$$$$3%7%:%>%B%C%/Q@YY8ߜ YY{-Q @GTimes New Roman5Symbol3 Arial7Courier5Monaco3Times? Courier New qh,4l24l3l |B " 0dQyWe developed a random sentence generator that trained itself on a corpus of natural written language, and then could generate sAcademic ComputingAcademic Computing Oh+'0 ,HTd     'We developed a random sentence generator that trained itself on a corpus of natural written language, and then could generate se dAcademic ComputingocadNormalcAcademic Computingo4adMicrosoft Word 9.0o@@h@g@,  |B ՜.+,0x hp  X'Stanford University"Q We developed a random sentence generator that trained itself on a corpus of natural written language, and then could generate s Title  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGIJKLMNOPQRSTUVWXZ[\]^_`bcdefghkRoot Entry F8Gm1TableHo WordDocumentSummaryInformation(YDocumentSummaryInformation8aCompObjX FMicrosoft Word DocumentNB6WWord.Document.8