Assignment No



SD Module- Python

Assignment No. 3

Title:

Write python code that loads any data set (example – game_medal.csv) & plot the graph.

Objectives:

Understand the basics of Data preprocessing,learn Pandas basic plot function ,matplotlib, Seaborn etc.

Problem Definition:

Develop Python Code that loads any data set (example – game_medal.csv) & plot the graph.

Outcomes:

10 1. Students will be able to demonstrate Python data preprocessing

11 2. Students will be able to demonstrate Plot the Graph in Python using Pandas Plot Function

12 3. Students will be able to demonstrate matplotlib, seborn packages.

Hardware Requirement: Any CPU with Pentium Processor or similar, 256 MB RAM or more,1 GB Hard Disk or more

14

Software Requirements: 32/64 bit Linux/Windows Operating System, R Studio

16

Theory:

Preprocessing

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.

Why preprocessing?

Real-world data are generally:

Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

Noisy: containing errors or outliers

Inconsistent: containing discrepancies in codes or names

Tasks in data preprocessing:

• Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.

• Data integration: using multiple databases, data cubes, or files.

• Data transformation: normalization and aggregation.

• Data reduction: reducing the volume but producing the same or similar analytical results.

• Data discretization: part of data reduction, replacing numerical attributes with nominal ones.

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

1. Plotting categorical scatter plots with Seaborn

1. # Plotting categorical scatter 

2. # plots with Seaborn

3.   

4. # importing the required module

5. import matplotlib.pyplot as plt

6. import seaborn as sns

7.   

8. # x axis values

9. x =['sun', 'mon', 'fri', 'sat', 'tue', 'wed', 'thu']

10.   

11. # y axis values

12. y =[5, 6.7, 4, 6, 2, 4.9, 1.8]

13.   

14. # plotting strip plot with seaborn

15. ax = sns.stripplot(x, y);

16.   

17. # giving labels to x-axis and y-axis

18. ax.set(xlabel ='Days', ylabel ='Amount_spend')

19.   

20. # giving title to the plot

21. plt.title('My first graph');

22.   

23. # function to show plot

24. plt.show()

[pic]

Explanation : This is the one of kind of scatter plot of categorical data with the help of seaborn.

• Categorical data is represented in x-axis and values correspond to them represented through y-axis.

• .striplot() function is used to define the type of the plot and to plot them on canvas using .

• .set() function is use to set labels of x-axis and y-aixs.

• .title() function is used to give title to the graph.

• To view plot we use .show() function.

2. Stripplot using inbuilt data-set given in seaborn :

# importing the required module

import matplotlib.pyplot as plt

import seaborn as sns

  

# use to set style of background of plot

sns.set(style ="whitegrid")

  

# loading data-set

iris = sns.load_dataset('iris');

  

# plotting strip plot with seaborn

# deciding the attributes of dataset on which plot should be made

ax = sns.stripplot(x = 'species', y = 'sepal_length', data = iris);

  

# giving title to the plot

plt.title('Graph')

  

# function to show plot

plt.show()

[pic]

Explanation:

• iris is the dataset already present in seaborn module for use.

• We use .load_dataset() function in order to load the data.We can also load any other file by giving path and name of file in the argument.

• .set(style=”whitegrid”) function here is also use to define the background of plot.We can use “darkgrid”

instead of whitegrid if we want dark colored background.

• In .stripplot() function we have define which attribute of the dataset to be on x-axis and which attribute of dataset should on y-axis.data = iris means attributes which we define earlier should be taken from the given data.

• We can also draw this plot with matplotlib but problem with matplotlib is its default parameters. The reason why Seaborn is so great with DataFrames is, for example, labels from DataFrames are automatically propagated to plots or other data structures as you see in the above figure column name species comes on x-axis and column name stepal_length comes on y-aixs, that is not possible with matplotlib. We have to explicitly define the labels of x-axis and y-axis.

3. Swarmplot using inbuilt data-set given in seaborn :

# importing the required module

import matplotlib.pyplot as plt

import seaborn as sns

  

# use to set style of background of plot

sns.set(style ="whitegrid")

  

# loading data-set

iris = sns.load_dataset('iris');

  

# plotting strip plot with seaborn

# deciding the attributes of dataset on which plot should be made

ax = sns.swarmplot(x = 'species', y = 'sepal_length', data = iris);

  

# giving title to the plot

plt.title('Graph')

  

# function to show plot

plt.show()

Explanation:

This is very much similar to striplot but the only difference is that is do not allow overlapping of markers.It cause jittering in the markers of the plot so that graph can easily be readed without information loss as seen in the above plot.

• We use .swarmplot() function to plot swarn plot.

• Another difference that we can notice in Seaborn and Matplotlib is that working with DataFrames doesn’t go quite as smoothly with Matplotlib, which can be annoying if we doing exploratory analysis with Pandas. And that’s exactly what Seaborn do easily, the plotting functions operate on DataFrames and arrays that contain a whole dataset.

[pic]

4. If we want we can also change the representation of data on a particular axis. For example :

# importing the required module

import matplotlib.pyplot as plt

import seaborn as sns

  

# use to set style of background of plot

sns.set(style ="whitegrid")

  

# loading data-set

iris = sns.load_dataset('iris');

  

# plotting strip plot with seaborn

# deciding the attributes of dataset on which plot should be made

ax = sns.swarmplot(x = 'sepal_length', y = 'species', data = iris);

  

  

# giving title to the plot

plt.title('Graph')

  

# function to show plot

plt.show()

Explanation - The same can be done in striplot. At last we can say that Seaborn is extended version of matplotlib which tries to make a well-defined set of hard things easy.

[pic]

Matplotlib- which is arguably the most popular graphing and data visualization library for Python.

# importing the required module

import matplotlib.pyplot as plt

  

# x axis values

x = [1,2,3]

# corresponding y axis values

y = [2,4,1]

  

# plotting the points 

plt.plot(x, y)

  

# naming the x axis

plt.xlabel('x - axis')

# naming the y axis

plt.ylabel('y - axis')

  

# giving a title to my graph

plt.title('My first graph!')

  

# function to show the plot

plt.show()

Following steps were followed:

• Define the x-axis and corresponding y-axis values as lists.

• Plot them on canvas using .plot() function.

• Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.

• Give a title to your plot using .title() function.

• Finally, to view your plot, we use .show() function.

[pic]

2. Plotting two or more lines on same plot

import matplotlib.pyplot as plt

  

# line 1 points

x1 = [1,2,3]

y1 = [2,4,1]

# plotting the line 1 points 

plt.plot(x1, y1, label = "line 1")

  

# line 2 points

x2 = [1,2,3]

y2 = [4,1,3]

# plotting the line 2 points 

plt.plot(x2, y2, label = "line 2")

  

# naming the x axis

plt.xlabel('x - axis')

# naming the y axis

plt.ylabel('y - axis')

# giving a title to my graph

plt.title('Two lines on same graph!')

  

# show a legend on the plot

plt.legend()

  

# function to show the plot

plt.show()

• Here, we plot two lines on same graph. We differentiate between them by giving them a name(label) which is passed as an argument of .plot() function.

• The small rectangular box giving information about type of line and its color is called legend. We can add a legend to our plot using .legend() function.

[pic]

3. Customization of Plots

import matplotlib.pyplot as plt

  

# x axis values

x = [1,2,3,4,5,6]

# corresponding y axis values

y = [2,4,1,5,2,6]

  

# plotting the points 

plt.plot(x, y, color='green', linestyle='dashed', linewidth = 3,

         marker='o', markerfacecolor='blue', markersize=12)

  

# setting x and y axis range

plt.ylim(1,8)

plt.xlim(1,8)

  

# naming the x axis

plt.xlabel('x - axis')

# naming the y axis

plt.ylabel('y - axis')

  

# giving a title to my graph

plt.title('Some cool customizations!')

  

# function to show the plot

plt.show()

we have done several customizations like

• setting the line-width, line-style, line-color.

• setting the marker, marker’s face color, marker’s size.

• overriding the x and y axis range. If overriding is not done, pyplot module uses auto-scale feature to set the axis range and scale.

[pic]

4. Bar Chart-

import matplotlib.pyplot as plt

  

# x-coordinates of left sides of bars 

left = [1, 2, 3, 4, 5]

  

# heights of bars

height = [10, 24, 36, 40, 5]

  

# labels for bars

tick_label = ['one', 'two', 'three', 'four', 'five']

  

# plotting a bar chart

plt.bar(left, height, tick_label = tick_label,

        width = 0.8, color = ['red', 'green'])

  

# naming the x-axis

plt.xlabel('x - axis')

# naming the y-axis

plt.ylabel('y - axis')

# plot title

plt.title('My bar chart!')

  

# function to show the plot

plt.show()

• Here, we use plt.bar() function to plot a bar chart.

• x-coordinates of left side of bars are passed along with heights of bars.

• you can also give some name to x-axis coordinates by defining tick_labels

[pic]

5. Histogram

import matplotlib.pyplot as plt

  

# frequencies

ages = [2,5,70,40,30,45,50,45,43,40,44,

        60,7,13,57,18,90,77,32,21,20,40]

  

# setting the ranges and no. of intervals

range = (0, 100)

bins = 10  

  

# plotting a histogram

plt.hist(ages, bins, range, color = 'green',

        histtype = 'bar', rwidth = 0.8)

  

# x-axis label

plt.xlabel('age')

# frequency label

plt.ylabel('No. of people')

# plot title

plt.title('My histogram')

  

# function to show the plot

plt.show()

• Here, we use plt.hist() function to plot a histogram.

• frequencies are passed as the ages list.

• Range could be set by defining a tuple containing min and max value.

• Next step is to “bin” the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. Here we have defined bins = 10. So, there are a total of 100/10 = 10 intervals.

[pic]

6. Scatter plot

import matplotlib.pyplot as plt

  

# x-axis values

x = [1,2,3,4,5,6,7,8,9,10]

# y-axis values

y = [2,4,5,7,6,8,9,11,12,12]

  

# plotting points as a scatter plot

plt.scatter(x, y, label= "stars", color= "green", 

            marker= "*", s=30)

  

# x-axis label

plt.xlabel('x - axis')

# frequency label

plt.ylabel('y - axis')

# plot title

plt.title('My scatter plot!')

# showing legend

plt.legend()

  

# function to show the plot

plt.show()

• Here, we use plt.scatter() function to plot a scatter plot.

• Like a line, we define x and corresponding y – axis values here as well.

• marker argument is used to set the character to use as marker. Its size can be defined using s parameter.

[pic]

7. Pie Chart

import matplotlib.pyplot as plt

# defining labels

activities = ['eat', 'sleep', 'work', 'play']

# portion covered by each label

slices = [3, 7, 8, 6]

# color for each label

colors = ['r', 'y', 'g', 'b']

# plotting the pie chart

plt.pie(slices, labels = activities, colors=colors,

startangle=90, shadow = True, explode = (0, 0, 0.1, 0),

radius = 1.2, autopct = '%1.1f%%')

# plotting legend

plt.legend()

# showing the plot

plt.show()

• Here, we plot a pie chart by using plt.pie() method.

• First of all, we define the labels using a list called activities.

• Then, portion of each label can be defined using another list called slices.

• Color for each label is defined using a list called colors.

• shadow = True will show a shadow beneath each label in pie-chart.

• startangle rotates the start of the pie chart by given degrees counterclockwise from the x-axis.

• explode is used to set the fraction of radius with which we offset each wedge.

• autopct is used to format the value of each label. Here, we have set it to show the percentage value only upto 1 decimal place.

[pic]

8. Plotting curves of given equation

# importing the required modules

import matplotlib.pyplot as plt

import numpy as np

  

# setting the x - coordinates

x = np.arange(0, 2*(np.pi), 0.1)

# setting the corresponding y - coordinates

y = np.sin(x)

  

# potting the points

plt.plot(x, y)

  

# function to show the plot

plt.show()

Here, we use NumPy which is a general-purpose array-processing package in python.

• To set the x – axis values, we use np.arange() method in which first two arguments are for range and third one for step-wise increment. The result is a numpy array.

• To get corresponding y-axis values, we simply use predefined np.sin() method on the numpy array.

• Finally, we plot the points by passing x and y arrays to the plt.plot() function.

[pic]

Mini Project-1

Develop Python Code that loads any data set (example – game_medal.csv) & plot the graph.

The data used was provided by The Guardian at Kaggle: Olympic Sports and Medals, 1896-2014. The first step will be to see the form of the data and manipulate it into a suitable format: rows as countries, columns as olympic games, values as medal counts.

Download Link-

Description of Data Sets

Which Olympic athletes have the most gold medals? Which countries are they from and how has it changed over time?

More than 35,000 medals have been awarded at the Olympics since 1896. The first two Olympiads awarded silver medals and an olive wreath for the winner, and the IOC retrospectively awarded gold, silver, and bronze to athletes based on their rankings. This dataset includes a row for every Olympic athlete that has won a medal since the first games.

Data was provided by the IOC Research and Reference Service and published by The Guardian's Datablog.

Olympic Games.zip Folder contain 3 different dataset.

125 Dictionary.csv

126 Summer.csv

127 Winter.csv

Following Figure shows description of dictionary.csv

[pic]

Following Figure shows description of summer.csv

[pic]

Following Figure shows description of winter.csv

[pic]

Here we are work on summer.csv Dataset

Dataset Download Link-

[pic]

Complete Code-

Stream graph- stream graphs can show a visually appealing and story rich method for presenting frequency data in multiple categories across a time-like dimension.

[pic]

Here we see that each entry is an Athlete representing a Country, of a given Gender, who won a Medal in some Event in the Olympics in City in a particular Year. For team-based sports, multiple individuals can receive medals, but we'll want to count these medals only once

Then using a groupby on Country and Year, if we count the Medals and unstack the result, we end up with a dataframe in the desired format.

[pic]

Now, the NYT only includes eight named countries (the rest are grouped by continent). So we'll want to identify what these countries are in the list, based on their IOC country codes. There's some interesting trivia in which countries/regions/groups are included/excluded/merge/divide with time. At this point we can ignore the rest of the data and just focus on these categories

countries = [

"USA", # United States of America

"CHN", # China

"RU1", "URS", "EUN", "RUS", # Russian Empire, USSR, Unified Team (post-Soviet collapse), Russia

"GDR", "FRG", "EUA", "GER", # East Germany, West Germany, Unified Team of Germany, Germany

"GBR", "AUS", "ANZ", # Australia, Australasia (includes New Zealand)

"FRA", # France

"ITA" # Italy

]

sm = summer.loc[countries]

sm.loc["Rest of world"] = summer.loc[summer.index.difference(countries)].sum()

sm = sm[::-1]

Before any plotting, let's define colours similar to those in the NYT graph. For simplicity, I'll be using the named colours in matplotlib.

country_colors = {

"USA":"steelblue",

"CHN":"sandybrown",

"RU1":"lightcoral", "URS":"indianred", "EUN":"indianred", "RUS":"lightcoral",

"GDR":"yellowgreen", "FRG":"y", "EUA":"y", "GER":"y",

"GBR":"silver",

"AUS":"darkorchid", "ANZ":"darkorchid",

"FRA":"silver",

"ITA":"silver",

"Rest of world": "gainsboro"}

Let's present this data as a stacked bar plot. This will show: i) the total number of medals won (total height) and ii) compare the relative number of medals countries won in different years.

%matplotlib inline

import matplotlib.pyplot as plt

import seaborn as sns

import numpy as np

sns.set_style("ticks")

sns.set_context("notebook", font_scale=1.2)

colors = [country_colors[c] for c in sm.index]

plt.figure(figsize=(12,8))

sm.T.plot.bar(stacked=True, color=colors, ax=plt.gca())

# Reverse the order of labels, so they match the data

handles, labels = plt.gca().get_legend_handles_labels()

plt.legend(handles[::-1], labels[::-1])

# Set labels and remove superfluous plot elements

plt.ylabel("Number of medals")

plt.title("Stacked barchart of select countries' medals at the Summer Olympics")

sns.despine()

[pic]

This plot is quite different to the desired graph. In particular, the bars don't have any continuity (which we'll achieve by using the plot.area method of DataFrames. And secondly, we don't have zero values for when the World Wars occurred.

sm[1916] = np.nan # WW1

sm[1940] = np.nan # WW2

sm[1944] = np.nan # WW2

sm = sm[sm.columns.sort_values()]

plt.figure(figsize=(12,8))

sm.T.plot.area(color=colors, ax=plt.gca(), alpha=0.5)

# Reverse the order of labels, so they match the data

handles, labels = plt.gca().get_legend_handles_labels()

plt.legend(handles[::-1], labels[::-1])

# Set labels and remove superfluous plot elements

plt.ylabel("Number of medals")

plt.title("Stacked areachart of select countries' medals at the Summer Olympics")

plt.xticks(sm.columns, rotation=90)

sns.despine()

[pic]

This is looking much better. There are two features we are missing: i) this plot has a baseline (i.e. the bottom of the chart) set at zero, whereas we want the baseline to wiggle about ii) the transitions between times are jagged.

To fix the baseline, instead of using pandas's plot.area method, we use the stackplot function from matplotlib. Here, we show what the different baselines look like.

for bl in ["zero", "sym", "wiggle", "weighted_wiggle"]:

plt.figure(figsize=(6, 4))

f = plt.stackplot(sm.columns, sm.fillna(0), colors=colors, baseline=bl, alpha=0.5, linewidth=1)

[a.set_edgecolor(sns.dark_palette(colors[i])[-2]) for i,a in enumerate(f)] # Edges to be slighter darker

plt.title("Baseline: {}".format(bl))

plt.axis('off')

plt.show()

[pic][pic]

Conclusion/Analysis: Hence we are able to draw the various plot using seaborn, matplotlib and pandas packages on suitable dataset.

Assignment Question?

1. What is pandas ?

2. What is matplotlib?

3. What is Seaborn?

4. What is Dataframe?

5. What is syntax for read csv file in python?

6. What is numpy?

7. How to drop Drop duplicate pairs?

Oral Question?

1. What do you mean histogram?

2. What do you mean scatter plot?

3. What do you mean pie chat?

4. What do you mean bar chart?

5. What do you mean heatmap?

6. What do you mean scatter plot?

References:-



[pic]

-----------------------

|W (4) |C |D |V |T |Total Marks with |

| |(4) |(4) |(4) |(4) |Sign |

| | | | | | |

-----------------------

SNJB’S K.B.J. COLLEGE OF ENGINEERING, CHANDWAD

1

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download