CS 6903/4903 Big Data Analysis for Information Security



CS 6903 Big Data Analysis

COURSE DESCRIPTION

A project-oriented hands-on course focuses on big data analysis for information security in distributed and parallel environment. Students will explore key concepts of data analysis and distributed and parallel processing system architecture applied to massive network traffic datasets to support real-time decision making for security threats in distributed environments. This course is designed to be a hands-on learning experience that students learn better by doing. With concrete, practical experience students will be better prepared to apply their new knowledge into real-life, data-intensive, research situations. Students are expected to make of map-reduce parallel computing paradigm and associated technologies such as distributed file systems, Mongo databases, and stream computing engines to design scalable big data analysis system. Students should be able to apply machine leaning, supervised or unsupervised-learning, information extraction and feature selection and stream mining in information security domains.

Course objectives

At the end of this course, the student will become familiar with the fundamental concepts of Big Data analysis; will understand the distributed and parallel computing and its application for big data analysis; will become competent in recognizing challenges faced by security applications dealing with a huge amount of data as well as in proposing scalable solutions for them; and will be able to understand how Big Data impacts business intelligence, scientific discovery, and our day-to-day life.

Challenges

Students will explore, design, and implement several real world big data analysis challenge problem.

Projects

The Students would be free to explore a problem of their interest on big data analysis for security and propose their own solutions with the consent from the instructors. We will have weekly meetings to discuss every proposed project. Projects will be completed individually.

Labs

There are six individual labs to get you started in two parts

Part I Fundamentals of Big Data Analysis Environment:

Distributed and Parallel System Architecture and Configuration

In this lab, students will set up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux. After successful installation on one node, students will configure a 3-node Hadoop cluster(one master and two slaves). This lab requires extensive system configuration knowledge and practices.

MapReduce on Word Counting

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. To get familiar with the MapReduce platform in Hadoop, a word counting program is used to convey the fundamental concept. Students will develop a working MapReduce application for word counting on Hadoop cluster.

MapReduce on Word Counting

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. To get familiar with the MapReduce platform in Hadoop, a word counting program is used to convey the fundamental concept. Students will develop a working MapReduce application for word counting on their own Hadoop cluster.

NoSQL

Relational schema format is not suitable for storing huge data volume in big data analysis processing any more, instead NoSQL MongoDB/HBASE is widely used as I/O storage for Hadoop to deliver complex analytics and data processing. Students can use it to pull your data into Hadoop Map-Reduce jobs, process the data and return results back to a NoSQL database collection. Students will have hands-on experience in covert unstructured data into NoSQL data and do all necessary operation such as NoSQL query with API.

Machine learning and Reasoning with Mahout

Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the correct category. Students will learn how to use Mahout machine learning library to facilitate the knowledge build up in big data analysis.

Part II Big data analysis for security

Deny of Service Attack Analysis

Distributed Denial of Service (DDoS) are one of the common attempts in security hacking for making computation resources unavailable or to impair geographical networks. To analyze such attack patterns in network usage, Hadoop and Map Reduce can step in. While Map Reduce detects packet traffic anomalies, the scalable Hadoop architecture offers solutions for processing data in a reasonable response time. In this lab, we will development two algorithms: counter based method, and access pattern based method. The counter based method relies on three key parameters: time interval which is the duration during which packets are to be analyzed, threshold which indicates frequency of requests and unbalance ratio which denotes the anomaly ratio of response per page requested between specific client and server. The access pattern based method relies on a pattern which differentiates the normal traffic from a DDoS traffic.

Tutorials and Hands-on Practice labs are available at:



References:

Recommended reading materials:









Email scam





Security analysis









Reference:







Big data sets:











................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download