ࡱ> >@= 'bjbj >6rrhhhhh||||$|        $l 1h1hhF)))hh ) ))))\0)  )) hG)11#  .: CS 6903 Big Data Analysis COURSE DESCRIPTION A project-oriented hands-on course focuses on big data analysis for information security in distributed and parallel environment. Students will explore key concepts of data analysis and distributed and parallel processing system architecture applied to massive network traffic datasets to support real-time decision making for security threats in distributed environments. This course is designed to be a hands-on learning experience that students learn better by doing. With concrete, practical experience students will be better prepared to apply their new knowledge into real-life, data-intensive, research situations. Students are expected to make of map-reduce parallel computing paradigm and associated technologies such as distributed file systems, Mongo databases, and stream computing engines to design scalable big data analysis system. Students should be able to apply machine leaning, supervised or unsupervised-learning, information extraction and feature selection and stream mining in information security domains. Course objectives At the end of this course, the student will become familiar with the fundamental concepts of Big Data analysis; will understand the distributed and parallel computing and its application for big data analysis; will become competent in recognizing challenges faced by security applications dealing with a huge amount of data as well as in proposing scalable solutions for them; and will be able to understand how Big Data impacts business intelligence, scientific discovery, and our day-to-day life. Challenges Students will explore, design, and implement several real world big data analysis challenge problem. Projects The Students would be free to explore a problem of their interest on big data analysis for security and propose their own solutions with the consent from the instructors. We will have weekly meetings to discuss every proposed project. Projects will be completed individually. Labs There are six individual labs to get you started in two parts Part I Fundamentals of Big Data Analysis Environment: Distributed and Parallel System Architecture and Configuration In this lab, students will set up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux. After successful installation on one node, students will configure a 3-node Hadoop cluster(one master and two slaves). This lab requires extensive system configuration knowledge and practices. MapReduce on Word Counting Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. To get familiar with the MapReduce platform in Hadoop, a word counting program is used to convey the fundamental concept. Students will develop a working MapReduce application for word counting on Hadoop cluster. MapReduce on Word Counting Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. To get familiar with the MapReduce platform in Hadoop, a word counting program is used to convey the fundamental concept. Students will develop a working MapReduce application for word counting on their own Hadoop cluster. NoSQL Relational schema format is not suitable for storing huge data volume in big data analysis processing any more, instead NoSQL MongoDB/HBASE is widely used as I/O storage for Hadoop to deliver complex analytics and data processing. Students can use it to pull your data into Hadoop Map-Reduce jobs, process the data and return results back to a NoSQL database collection. Students will have hands-on experience in covert unstructured data into NoSQL data and do all necessary operation such as NoSQL query with API. Machine learning and Reasoning with Mahout Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the correct category. Students will learn how to use Mahout machine learning library to facilitate the knowledge build up in big data analysis. Part II Big data analysis for security Deny of Service Attack Analysis Distributed Denial of Service (DDoS) are one of the common attempts in security hacking for making computation resources unavailable or to impair geographical networks. To analyze such attack patterns in network usage, Hadoop and Map Reduce can step in. While Map Reduce detects packet traffic anomalies, the scalable Hadoop architecture offers solutions for processing data in a reasonable response time. In this lab, we will development two algorithms: counter based method, and access pattern based method. The counter based method relies on three key parameters: time interval which is the duration during which packets are to be analyzed, threshold which indicates frequency of requests and unbalance ratio which denotes the anomaly ratio of response per page requested between specific client and server. The access pattern based method relies on a pattern which differentiates the normal traffic from a DDoS traffic. Tutorials and Hands-on Practice labs are available at: https://sites.google.com/site/bigdatansalabware/ References: Recommended reading materials: http://www-01.ibm.com/software/data/bigdata/use-cases/security-intelligence.html http://blogs.cisco.com/security/big-data-in-security-part-i-trac-tools/ http://blogs.cisco.com/security/big-data-in-security-part-ii-the-amplab-stack/ http://blogs.cisco.com/security/big-data-in-security-part-v-anti-phishing-in-the-cloud/ Email scam http://blogs.igalia.com/dpino/2012/08/07/metamail-email-analysis-with-hadoop/ http://blogs.cisco.com/security/big-data-in-security-part-iv-email-auto-rule-scoring-on-hadoop/ Security analysis http://healthitsecurity.com/2013/10/11/csa-report-big-data-analytics-can-improve-it-security/ http://www.cisco.com/web/ME/connect2014/saudiarabia/pdf/ahmed_fakahany_ibm_sbm_big_data_internet_of_things.pdf http://bigdatablog.emc.com/2013/01/30/rsa-security-analytics/ http://cybersecurity.mit.edu/2013/11/mobile-malware-analysis-in-hadoop/ http://blogs.cisco.com/security/threat-detection-a-big-data-approach-to-security/Reasoning Reference: http://machinelearningbigdata.blogspot.com/ http://www.bigdatatraining.in/machine-learning-training/  HYPERLINK "http://www.slideshare.net/Cataldo/apache-mahout-tutorial-recommendation-20132014" http://www.slideshare.net/Cataldo/apache-mahout-tutorial-recommendation-20132014 Big data sets: http://www.kdnuggets.com/2011/02/free-public-datasets.html http://aws.amazon.com/publicdatasets/ http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public http://stackoverflow.com/questions/2674421/free-large-datasets-to-experiment-with-hadoop http://www.ll.mit.edu/mission/communications/cyber/CSTcorpora/ideval/data/ 8Wm   u  (    AB\GOv 5L[ hd.ghjTh%|hjT5CJ\aJh%|h%|5CJ\aJh%|h> h@rhjT h%nhjT h^h^h+ h+ h+ h^hy}hjThc hU3hjThe<-4 F 9D B\xgd19gd \3gd \3gdrYgdd.ggd JgdrYgdrYgdRgd|wgd-ygdT gdU3gdU3[\Hbjqwx-T_aiϼϦϦϓϓϼπj*h>h>B*^JfHphUUUq $h19B*^JfHphIGGq $h>B*^JfHphIGGq *h%|h \3B*^JfHphIGGq $heB*^JfHphIGGq $h%|B*^JfHphIGGq hjThe hd.gh \3h>h \3 hd.ghjT"$Ll!?!p!|!!!4""""4###$s$$$_%%%v&gdgdgd|wgdrYgd \3gd19gd>#$KLlo !!?!P!p!|!!""##J%%%#&$&г|xphdhdhd`X`XjhUhhhKh5hKh5hK hKhKhKhK5 hqR#hjThjT8heh>5B*CJ\^JaJfHphIGGq 8h%|h%|5B*CJ\^JaJfHphIGGq heh>*h>h>B*^JfHphUUUq $h>B*^JfHphUUUq !$&t&u&v&&'h,6hKh,65hjhUhIdUh0Jv&&&&.'''gd,6gd<P1h:p $/ =!"#$% Dpf 666666666vvvvvvvvv666666>6666666666666666666666666666666666666666666666666hH6666666666666666666666666666666666666666666666666666666666666666662 0@P`p2( 0@P`p 0@P`p 0@P`p 0@P`p 0@P`p 0@P`p8XV~OJQJ_HmH nH sH tH J`J  $Normal dCJ_HaJmH sH tH d@d U3 Heading 1$$@&#5B*CJOJPJQJ\aJph6_d@d  J Heading 2$$@&#5B*CJOJPJQJ\aJphODA D Default Paragraph FontRiR 0 Table Normal4 l4a (k ( 0No List \o\ U3Heading 1 Char'5B*CJOJPJQJ\^JaJph6_\o\  JHeading 2 Char'5B*CJOJPJQJ\^JaJphOj>@j U3Title d,&dPOm$%@B* CJ4KHOJPJQJaJ4ph6]Vo!V U3 Title Char)@B* CJ4KHOJPJQJ^JaJ4ph6]6U`16 0 Hyperlink >*B*phPK![Content_Types].xmlN0EH-J@%ǎǢ|ș$زULTB l,3;rØJB+$G]7O٭V$ !)O^rC$y@/yH*񄴽)޵߻UDb`}"qۋJחX^)I`nEp)liV[]1M<OP6r=zgbIguSebORD۫qu gZo~ٺlAplxpT0+[}`jzAV2Fi@qv֬5\|ʜ̭NleXdsjcs7f W+Ն7`g ȘJj|h(KD- dXiJ؇(x$( :;˹! I_TS 1?E??ZBΪmU/?~xY'y5g&΋/ɋ>GMGeD3Vq%'#q$8K)fw9:ĵ x}rxwr:\TZaG*y8IjbRc|XŻǿI u3KGnD1NIBs RuK>V.EL+M2#'fi ~V vl{u8zH *:(W☕ ~JTe\O*tHGHY}KNP*ݾ˦TѼ9/#A7qZ$*c?qUnwN%Oi4 =3N)cbJ uV4(Tn 7_?m-ٛ{UBwznʜ"Z xJZp; {/<P;,)''KQk5qpN8KGbe Sd̛\17 pa>SR! 3K4'+rzQ TTIIvt]Kc⫲K#v5+|D~O@%\w_nN[L9KqgVhn R!y+Un;*&/HrT >>\ t=.Tġ S; Z~!P9giCڧ!# B,;X=ۻ,I2UWV9$lk=Aj;{AP79|s*Y;̠[MCۿhf]o{oY=1kyVV5E8Vk+֜\80X4D)!!?*|fv u"xA@T_q64)kڬuV7 t '%;i9s9x,ڎ-45xd8?ǘd/Y|t &LILJ`& -Gt/PK! ѐ'theme/theme/_rels/themeManager.xml.relsM 0wooӺ&݈Э5 6?$Q ,.aic21h:qm@RN;d`o7gK(M&$R(.1r'JЊT8V"AȻHu}|$b{P8g/]QAsم(#L[PK-![Content_Types].xmlPK-!֧6 0_rels/.relsPK-!kytheme/theme/themeManager.xmlPK-!0C)theme/theme/theme1.xmlPK-! ѐ' theme/theme/_rels/themeManager.xml.relsPK] 6[$&'v&'#tX8@0(  B S  ?_GoBack  ! * \ e  py'jq"o w cq)03333333333UWdlmB\v bbqwFT##KKJSUWdlmB\v bbqwFT##KK2S3S&T qR# \319 JjT2V^Mad.g'h@r;t|w7w-y%|P,y}HeYU3R%nzXKP&b $>c+ ,6rY@@UnknownG*Ax Times New Roman5Symbol3. *Cx Arial;. *Cx Helvetica7.@Calibri7@Cambria;(SimSun[SOA$BCambria Math"Ah&GZ' 99243HP $P|w2!xx 7CS 6903/4903 Big Data Analysis for Information SecurityKai QianKai QianOh+'0  , L X dpx8CS 6903/4903 Big Data Analysis for Information Security Kai QianNormal Kai Qian9Microsoft Office Word@~@lY@ ՜.+,D՜.+,D hp  $(Southern Polytechnic State University9 8CS 6903/4903 Big Data Analysis for Information Security Title$ 8@ _PID_HLINKSAOQhttp://www.slideshare.net/Cataldo/apache-mahout-tutorial-recommendation-20132014  !"#$%&'()*+,./012346789:;<?Root Entry FA1Table WordDocument>6SummaryInformation(-DocumentSummaryInformation85CompObjr  F Microsoft Word 97-2003 Document MSWordDocWord.Document.89q