ࡱ> pro5@ 0Xbjbj22 "jXX#tttt<8<2;;;;;;;$=R0@;u;ttf<---t8;-;--?6G57 `0r'69|<0<C6@M@87tttt@7-;;$Introduction to Bioinformatics DNA sequence annotation - Final Project Due Day: Friday, March 28 in class presentations of the project results Instructions: You will work on the project in class. In this project you will work with real data. You will get the DNA sequence by e-mail. The sequence is saved in the text file and it is one long string of characters without spaces or new lines. Please, read the full project description, before you start to write your programs. You dont need to break the project into steps. It was done for your convenience. If you would like to combine steps, you would be able to do it. Read carefully grading rubrics, BEFORE you start to work on the project. Project description: Step 1: As a first step you will find potential genes that are present in the input DNA sequence. You would need to find genes on the original sequence that was sent to you by e-mail, we will call this sequence MAIN sequence, and also you would need to find genes on the complementary string, that we will call COMPLEMENT sequence. Instruction on how to find a gene: Rule 1: This rule is for MAIN sequence: If the gene is on the strand of DNA it will start on the GIVEN strand with one of the following codons TTA, CTA or TCA as a start codons and a sequence CAT as an end codon. While calculating the length of the potential gene in this case, dont count the start codon, but do count the end codon. Rule 2: This rule is for COMPLEMENT sequence. If a gene is on the complementary strand, it starts with codon TAC and ends with one of the following codons: ATT, ATC, ACT. While calculating the length of the potential gene in this case, count the start codon, but dont count an end codon. See programming requirements for this part. Rule 3: Additional restriction is that the length of the gene should be divisible by 3. Programming Requirements: Before you start to work with the real DNA sequence that is given to you, first, test your program on the short input sequence, to make sure that the program outputs a correct result. To find a gene on the complementary strand of DNA, you would need to complement the input sequence and to apply rule 2. The rule for finding a DNA complement is: A ( T, T ( A, C ( G and G ( C. For example, if the input DNA sequence is TTACCGTCAT the complement will be AATGGCAGTA. And now you need to apply rule 2 on the resulting complement sequence. Input/Output: in this project you will work with the large input data and you would need to know how to read input from the file. You also would need to output the program results into the file and you would need to learn how to output results into the file. You can learn by yourself how to use file input/output in Python and to get 10 points bonus, or you can use UNIX input/output redirection (see the handout with UNIX input/output redirection example that attached to the project description) The output for this step should include the following: The list of the potential genes (length larger than 300) in increasing order according to their length in the main string and in the complement string. In step 3 you will finalize if the potential gene is real gene. The list of the genes that have the length that less than 300 (probable pseudogenes) in increasing order according to their length in the main string and in the complement string. Starting position of each potential gene and pseudogene in the input sequence. Assume that the first position of the input sequence is 0 (as usual in Python sequence data type) For your convenience, it would help to summarize the results of this step in four tables: two tables for potential genes (main string and complement string) and two tables for pseudogenes (you can create separate Word file for the summary): Table 1 and 2: Potential genes: Table 1 for Main String. Table 2 for Complement Reverse String. Length e" 300 List genes in the increasing order of the length Gene numberStart positionLengthYou can assign numbers for the potential genes that you found. You don t need to copy the actual sub-sequenceRelatively to the beginning of the input sequence Table 3 and 4: Pseudogenes. Table 3 for Main String. Table 4 for Complement Reverse String. Length d" 300 List genes in the increasing order of the length Gene numberStart positionLengthYou can assign numbers for the potential genes that you found. You dont need to copy the actual sub-sequenceRelatively to the beginning of the input sequence Step 2: In this step you will BLAST the potentially real genes that you found in step1: (in this step you can use different web resources, for example you can use Biology workbench instead of NCBI). Go to the NCBI home page:  HYPERLINK "http://www.ncbi.nlm.nih.gov/" http://www.ncbi.nlm.nih.gov/ Choose BLAST Choose: BLASTX (under Translated: Translated query vs. protein database) The output for this step should include the following: Save the results of the searches for the final summary. It could be that different genes will produce the same results, since you are searching protein database. For your convenience, it would help to summarize the results of this step in the tables: (you can add an additional column to Tables 1and 2that you created in step 1): Table 1 and 2: Potential genes. Table 1 for Main String. Table 2 for Complement Reverse String. Length e" 300. List genes in the increasing order of the length Gene numberStart positionLengthBLAST resultsYou can assign numbers for the potential genes that you found. You don t need to copy the actual sub-sequenceRelatively to the beginning of the input sequence Step 3: In this step you will locate potential promoters in the given DNA sequence for each potential gene that you found in step 1 and find the strength of the promoter. A promoter is a region of DNA near the beginning of a gene that controls if and when the gene is actually expressed. How to find and promoter and its strength: For each potential gene on the COMPLEMENT string that you found in step 1, find a sub-sequence that is located between positions n 14 and n 6, including nucleotides at the position n-14 and n-6, where n is the start position of the potential gene in the sequence. Pay attention, that n should be larger than 14 for gene to have a promoter. If you a have a potential gene that starts at the position between 0 and 13, this potential gene will not have a promoter. The length of the promoter string is 9 Find an alignment score for the found promoter and the promoter consensus sequence is: TG_TATAAT, where the underscore can be any base. Use the following scoring rule: match = 1 and mismatch = 0. The underscore position always will be considered as a match. The alignment score should be calculated as percent: (score/9)*100 In general, higher alignment score means better promoter and means that the researched sequence more likely to be a real gene Reverse MAIN input string and repeat items 1 and 2 for the reversed sequence. Final Output for the Project: Summarize the results of all steps in the following tables Table 1 and 2: Potential genes. Table 1 for Main String. Table 2 for Complement Reverse String. Length e" 300. LIST GENES IN THE INCRESING ORDER OF THE STRENGTH OF THE PROMOTER Gene numberStart positionPromoter ScoreLengthBlast ResultsSummary and ConclusionsYou can assign numbers for the potential genes that you found. You dont need to copy the actual sub-sequenceRelatively to the beginning of the input sequence The summary and conclusion part should include your answer if the potential gene could be a real gene based on the strength of the promoter and also BLAST results. You summary should include the finding for main input sequence and for the complement reversed sequence as well. Draw the diagram of the input DNA sequence and positions of potential genes, real genes and pseudogenes. FGPX_` 7 ;  p q      ! ļ~~th<56\]hRhTihI&hI&hI&5\h<h1Yh:h{7hH8hJKhF@qh<3h{7h<35h.jh<35h{7h5B*phh95B*phh5B*phh<35B*phh{7h<35B*phh.j*G q ! E F Q8^8gdH8^8gdH8^gd{7h^hgdI&gd<3 & Fgd<3$a$gd.jgd.jXX! 1 D E F L l m n  5;AQR^jklm$%HVj»ƭƨ{wswsokh<hkhPhI&hohuVhuV5hk0&h)th chPh{75hPhP5 hx5 hH5h NhH8hH85\ hx5\hhH8 hH85hh5 h<5h{756\]h h{756\]h hH856\]*QRm%xy>n $Ifgd"^gd  & F 88^8gd  & F 88^8gdh^hgd{7 & F gdWg & F gdPh^hgdI&8^8gd)tjt)]^pzxy6@AG +56=<@FLRS\yŻh h/hkh Bh<hczhh{7hA@Ahk0&56 h{756hA@Ah{756 h<56h[hI&hx jhX'hWghX'hhTi:%/>n2r|.]^ef#%@AMiklm9OQR?@BG¾ƲƫƠƾhgh!whA@Ah56h/hvKDh+[0Jjh+[U h+[h+[jh+[Uhehhbmh+[hI&hI&hI&5\hk0&h hcz5h hTi5 hg5 hcz5h h 52qhhh $Ifgd"kd$$IflF4< H 8 t06    44 la.Fdrqhh___ $Ifgd"^gd kdz$$IflF4< H 8 t06    44 lart(Z[qhhh $Ifgd"kd$$IflF4< H 8 t06    44 la[\]^&qh_VNNN & Fgd+[h^hgdI&^gdk0&^gd kdn$$IflF4< H 8 t06    44 laRdp $Ifgd"^gd/ & Fgd/ & Fgd!wh^hgd^gd Gdkq  p " """""L""1#2#4#w#####$$&$G$a$1%Y%Z%%%%%'<'X'j'k'l'n'||h@< .jTitle$a$5B* \ph6U@6 +[ Hyperlink >*B*phj@j K@ Table Grid7:V0# oQ Table Web 2h:V03j B*`Jph3 0+ Table Web 1h:V03j B*`Jph##jGq !EFQRm% x y > ~ABCDom()`JVelz{%Fql/}Rlmnd e h l!m!o!<"="@"_#`#c#########00000 0 0 0 0 00000p00p0000 0p0 00p 0 0 0p0p0p0p 0p 0p 0 00p0 0 0 0 0 0 0 0 00p0 0 0 0 0 0 0 0 000  0 0p 0p0p0p 0p 0p 000 0 0 0 0 0 0 0 0 0 00p0 0 0 0 000p0p0p0p0p0 0 0 0 0 0 0 0 0 0 0 0 0 0 00p0p00p0p0p0p00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0M90M90! jGn')WX"&1Qr[ "&)t*~*TRUZV*WMXXXX !#$%'()-./0234X#X8@0(  B S  ?#v{17 k u S^!!""## ?`xy  :;H##333333333333Rm~JzDlnne h m!o!="@"`#c#######yana L%&YH q:$0hGUg2@(%HtRJ\Lրȳ)$La2u53p640&{$I[~h^`OJQJo(hHhpp^p`OJQJ^Jo(hHoh@ @ ^@ `OJQJo(hHh^`OJQJo(hHh^`OJQJ^Jo(hHoh^`OJQJo(hHh^`OJQJo(hHhPP^P`OJQJ^Jo(hHoh  ^ `OJQJo(hH^`o(.8^`o(. pLp^p`LhH. @ @ ^@ `hH. ^`hH. L^`LhH. ^`hH. ^`hH. PLP^P`LhH.^`o(. ^`hH. pLp^p`LhH. @ @ ^@ `hH. ^`hH. L^`LhH. ^`hH. ^`hH. PLP^P`LhH.88^8`o(. ^`hH.  L ^ `LhH.   ^ `hH. xx^x`hH. HLH^H`LhH. ^`hH. ^`hH. L^`LhH.88^8`o(.^`o(.  L ^ `LhH.   ^ `hH. xx^x`hH. HLH^H`LhH. ^`hH. ^`hH. L^`LhH.hh^h`5o(. 88^8`hH. L^`LhH.   ^ `hH.   ^ `hH. xLx^x`LhH. HH^H`hH. ^`hH. L^`LhH.^`o(. ^`hH. pLp^p`LhH. @ @ ^@ `hH. ^`hH. L^`LhH. ^`hH. ^`hH. PLP^P`LhH.h ^`o(hH.h^`OJQJo(hH pLp^p`LhH. @ @ ^@ `hH. ^`hH. L^`LhH. ^`hH. ^`hH. PLP^P`LhH.^`5o(. ^`hH. pLp^p`LhH. @ @ ^@ `hH. ^`hH. L^`LhH. ^`hH. ^`hH. PLP^P`LhH.h^`OJQJo(hHhpp^p`OJQJ^Jo(hHoh@ @ ^@ `OJQJo(hHh^`OJQJo(hHh^`OJQJ^Jo(hHoh^`OJQJo(hHh^`OJQJo(hHhPP^P`OJQJ^Jo(hHoh  ^ `OJQJo(hH88^8`o(. ^`hH.  L ^ `LhH.   ^ `hH. xx^x`hH. HLH^H`LhH. ^`hH. ^`hH. L^`LhH. Y)$L L\LUg240&{Hq:$I[~%H53p           (6l                Nl          n>O       fIk                          'A                          kj:/{>CEzR Ch +[9/o &k0&4*0+,W/&k/:&1<3{7H8/99_m:K@A@A B`NoQWTX1Y@WordDocument"jSummaryInformation(_DocumentSummaryInformation8gCompObjj  FMicrosoft Word Document MSWordDocWord.Document.89q