Q: Explain Distance-based algorithms



Q1: Explain Distance-based algorithms in simple English. What are the two standard approaches used in distance-based algorithms. (10 points)

Distance based algorithms are methods of classification that each item that is mapped to the same class may be thought of as more similar to other items in the class than it is to the items found in the other classes. Therefore, similarity (or distance) measures may be used to identify the “alikeness” of different items in the database.

The two standard approaches are as follows :

1. Simple Approach (IR Approach): Simple approach is that after the representative vector (Centroids or Medoids) for each class is determined, it is used to place each item in the class where it is most similar (closest) to the center of the class.

2. K nearest neighbors: K nearest neighbors (KNN) assumes that the entire training set includes not only the data in the set but also the desired classification for each item. In effect, the training database becomes the model. When a classification is to be made for a new item , its distance to each item in the training set must be determined. Only the K closest entries in the training set are considered further. The new item is then placed in the class that contains the most items from this set of K closest items.

Q2: Define Classification trees. What are the main issues in Decision trees? What are the standard decision tree algorithms used in the industry. (10 points)

A decision tree is a tree where the root and each internal node are labeled with a question. The arcs emanating from each node represent each possible answer to the associated question. Each leaf node represents the prediction of a solution to the problem under consideration. The decision tree approach to classification is to divide the search space into rectangular regions.

The main issues are as follows:

• Missing Data: missing data values cause problems and must be handled or may produce inaccurate results. The three approaches to handle missing data are:

1. Ignore the missing data.

2. Assume a value for the missing data.

3. Assume a special value for the missing data.

• Measuring performance: Among different classifications, depends on the interpretation of the problem by the users. The performance of the classification algorithms is usually examined by evaluating the accuracy of the classification. OC curve and confusion matrix can examine the accuracy of the classification.

• Outliers: Outliers affect the division of the nodes. Three approaches to handle outliers are:

1. Omit if it is a mistake.

2. Compare the results between with and without outliers.

3. Use of median or trimming mean.

• Training Data: The size of the training data need to be well determined. If it is too small it underfits the data. If it is too big it overfits the data.

• Methodology: Different methodologies may produce different results. The most appropriate method needs to be selected carefully.

• Choosing splitting attributes: Which attribute to use for splitting attributes impacts the performance applying the built DT. Some attributes are better than others. The choice of attribute involves not only an examination of the data in the training set but also the informed input of the domain experts.

• Ordering of splitting attributes: The order in which the attributes are chosen is important.

• Splits: With some attributes the domain is small so the number of splits is obvious based on domain. However , if the domain is continuous or has a large number of values, the number of splits to use is not easily determined .

• Tree structure: To improve the performance of applying the tree for classification , a balanced tree with the fewest levels is desirable.

• Stopping Criteria: The creation of the tree definitely stops when the training data are perfectly classified. There may be situations when stopping earlier may be performed earlier to prevent the creation of larger trees. This is the tradeoff between the accuracy of classification and performance. In addition stopping earlier may be performed to prevent overfitting. It is also conceivable that more levels than need may be created in the tree if it is known that there are data distributions not represented in the training data.

• Pruning: Once the tree is constructed, some modifications to the tree might be needed to improve the performance of the tree during the classification phase. The pruning phase might remove redundant comparisons or remove sub trees to achieve better performance.

The standard Decision tree algorithms used in the industry are as follows:

• ID3: This technique is based on information theory and attempts to minimize the expected number of comparisons the basic strategy used is to choose splitting attributes with highest information gain first. Entropy is used to quantify information.

• C4.5 and C5.0: C4.5 improves ID3 in many ways. C 5.0 is the commercial version of the C4.5 widely used in many data mining packages such as Clementine and RuleQuest. It is targeted towards use in large datasets.

• CART (Classification and regression trees) : It is a technique that generates a binary decision tree. As with ID3, entropy is used to choose the best splitting attribute and criterion.

Table:

|  |Name |Gender |Height |Output1 |Output2 |

|1 |Kris |F |1.6 |Short |Medium |

|2 |Jim |M |2 |Tall |Medium |

|3 |Maggie |F |1.9 |Medium |Tall |

|4 |Martha |F |1.88 |Medium |Tall |

|5 |Stephanie |M |1.7 |Short |Medium |

|6 |Bob |M |1.85 |Medium |Medium |

|7 |Kat |F |1.6 |Short |Medium |

|8 |Dave |M |1.7 |Short |Medium |

|9 |Worth |M |2.2 |Tall |Tall |

|10 |Steve |M |2.1 |Tall |Tall |

|11 |Debbie |F |1.8 |Medium |Medium |

|12 |Todd |M |1.95 |Medium |Medium |

|13 |Kim |F |1.9 |Medium |Tall |

|14 |Amy |F |1.8 |Medium |Medium |

|15 |Wynette |F |1.75 |Medium |Medium |

|16 |John |M |2.1 |Tall |Medium |

|17 |Anna |F |2 |Tall |Tall |

|18 |Justin |M |1.65 |Short |Short |

|19 |Marvin |M |1.7 |Short |Medium |

|20 |Mary |F |1.9 |Tall |Medium |

Q3: Use KNN to classify < James, M, 1.85> with k=5 using the height data and assuming that Output1 is correct. (15 points)

Data sorted by height

|Kat |F |1.6 |Short |Medium |

|Kris |F |1.6 |Short |Medium |

|Justin |M |1.65 |Short |Short |

|Dave |M |1.7 |Short |Medium |

|Marvin |M |1.7 |Short |Medium |

|Stephanie |M |1.7 |Short |Medium |

|Wynette |F |1.75 |Medium |Medium |

|Amy |F |1.8 |Medium |Medium |

|Debbie |F |1.8 |Medium |Medium |

|Bob |M |1.85 |Medium |Medium |

| Martha |F |1.88 |Medium |Tall |

|Kim |F |1.9 |Medium |Tall |

|Maggie |F |1.9 |Medium |Tall |

|Mary |F |1.9 |Tall |Medium |

|Todd |M |1.95 |Medium |Medium |

|Anna |F |2 |Tall |Tall |

|Jim |M |2 |Tall |Medium |

|John |M |2.1 |Tall |Medium |

|Steve |M |2.1 |Tall |Tall |

|Worth |M |2.2 |Tall |Tall |

The 8 nearest neighbours are :

{, ,,,,},}}

All the 8 are classified as medium. We are looking at K=5, For this case it does not matter, which closest 5 you select, they are all Medium. Thus KNN will classify James as medium. In real life we have tie-breaker rule to select the best 5 out of eight.

Q4: Calculate the entropy in the Training Database. (Ref to the table – using Output 1) (10 points)

The data sorted by Output 1 and then by gender.

|Name |Gender |Height |Output1 |Output2 |

|Wynette |F |1.75 |Medium |Medium |

|Amy |F |1.8 |Medium |Medium |

|Debbie |F |1.8 |Medium |Medium |

|Martha |F |1.88 |Medium |Tall |

|Kim |F |1.9 |Medium |Tall |

|Maggie |F |1.9 |Medium |Tall |

|Bob |M |1.85 |Medium |Medium |

|Todd |M |1.95 |Medium |Medium |

|Kat |F |1.6 |Short |Medium |

|Kris |F |1.6 |Short |Medium |

|Justin |M |1.65 |Short |Short |

|Dave |M |1.7 |Short |Medium |

|Marvin |M |1.7 |Short |Medium |

|Stephanie |M |1.7 |Short |Medium |

|Mary |F |1.9 |Tall |Medium |

|Anna |F |2 |Tall |Tall |

|Jim |M |2 |Tall |Medium |

|John |M |2.1 |Tall |Medium |

|Steve |M |2.1 |Tall |Tall |

|Worth |M |2.2 |Tall |Tall |

The entropy of the starting set = 6/20 log(20/6) + 8/20log (20/8) +6/20log (20/6) = 0.472903

Q5: a. Find the gain in the entropy by using Gender attribute. (10 points)

The entropy of the subset that are F = 2/10 log (10/2) + 6/10 log (10/6) + 2/10 log (10/2) = 0.412697

The entropy of the subset that are M = 4/10 log (10/4) + 2/10 log (10/2) + 4/10 log (10/4) = 0.458146

The weighted sum of the two entropies = ((10/20)0.412697) + ((10/20)0.458146) = 0.425422

The gain in the entropy by using the gender attribute = 0.472903-0.435422 = 0.037482

b. Find the gain in the entropy by using Height attribute. (15 points)

|  |Name |Gender |Height |Output1 |Output2 |

|1 |Kat |F |1.6 |Short |Medium |

|2 |Kris |F |1.6 |Short |Medium |

|3 |Justin |M |1.65 |Short |Short |

|4 |Dave |M |1.7 |Short |Medium |

|5 |Marvin |M |1.7 |Short |Medium |

|6 |Stephanie |M |1.7 |Short |Medium |

|7 |Wynette |F |1.75 |Medium |Medium |

|8 |Amy |F |1.8 |Medium |Medium |

|9 |Debbie |F |1.8 |Medium |Medium |

|10 |Bob |M |1.85 |Medium |Medium |

|11 |Martha |F |1.88 |Medium |Tall |

|12 |Kim |F |1.9 |Medium |Tall |

|13 |Maggie |F |1.9 |Medium |Tall |

|14 |Mary |F |1.9 |Tall |Medium |

|15 |Todd |M |1.95 |Medium |Medium |

|16 |Anna |F |2 |Tall |Tall |

|17 |Jim |M |2 |Tall |Medium |

|18 |John |M |2.1 |Tall |Medium |

|19 |Steve |M |2.1 |Tall |Tall |

|20 |Worth |M |2.2 |Tall |Tall |

S M T

(0 , 1.6] : 2 0 0 2/2 log (2/2) + 0 + 0=0

(1.6 , 1.7] : 4 0 0 4/4 log (4/4) + 0 + 0=0

(1.7 , 1.8]: 0 3 0 0 + 3/3 log (3/3) + 0=0

(1.8 , 1.9]: 0 5 0 0 + 5/5 log (5/5) + 0 =0

(1.9 , 2.0]: 0 1 2 0 + 1/3 log (3/1) + 2/3 log (3/2)=0.276435

(2.0 , ∞ ): 0 0 3 0 + 0 + 3/3 log (3/3) =0

The weighted sum of the six entropies = (3/20)0.2764 = 0.041465

The gain in the entropy by using the height attribute = 0.472903 – 0.041465 = 0.431438

Since the height attribute has a greater gain, I choose the height over the gain as the first splitting attribute.

Q6: Define Neural Networks. What are the main issues in using neural networks in classification problems?

(10 points)

Neural networks are the signal processing systems that try to emulate the behavior of the biological nervous systems by providing a mathematical model of combinations of numerous neurons connected in the network. It is a massively parallel interconnection of simple processing elements that interact with the objects of the real world in the manner similar to biological systems.

The main issues in using neural networks in classification problem are:

• Attributes (number of source nodes): It is to determine which attributes to use as splitting attributes.

• Number of hidden layers: In the simplest case there is only one hidden layer.

• Number of hidden nodes: Choosing the best number of hidden nodes per hidden layer is one of the most difficult problems when using NN.

• Training Data: As with DTs, with too much training data the NN may suffer from overfitting, while to little and it may not be able to classify accurately.

• Number of Sinks: Although it is normally assumed that the number of output nodes is the same as the number of classes, this is not always the case. For example, with classes there could be only one output node, with the resulting value being the probability of being in the associated classes. Subtracting this value from one would give the probability of being in the second class.

• Interconnections: In the simplest case, each node is connected to all the nodes in the next levels.

• Weights: the weight assigned o an arc indicates the relative weight between those two nodes. Initial weights are small positive and are assigned randomly.

• Activation Functions: Many different types of activation functions can be used .

• Learning technique: The technique for adjusting weights is called the learning technique. Although many approaches nay be used the most common approach some form of back propagation, which is discussed in the subsequent subsection.

• Stop: The learning technique may stop when all the training tuples have propagated through the network or may be based on time or error rate.

Q7: what are the advantages and disadvantages of neural networks? (10 points)

Advantages:

• NN are more robust than DTs because of weights.

• The NNs improves its performance by learning. This may continue even after the training set has been applied.

• The use of NNs can be parallized for better performance.

• There is a low error rate and a thus a high degree of accuracy once the appropriate training has been performed

• The NNs are more robust than DTs in noisy environments.

Disadvantages:

• NNs are difficult to understand. Non Technical users may have difficulty understanding how NNs work. Its easy to explain decision trees.

• Generating rules from NNs is not straight forward.

• Input attributes may not be numeric.

• Testing and verifications are difficult.

• As with DTs, overfitting may result.

• The learning phase may fail to converge.

• NNs may be quite expensive to use.

Q8: what are the steps involved in using neural networks for classification problems. (10 points)

1. Determine the number of output nodes as well as what attributes should be used as input. The number of hidden layers also must be decided. This step should be performed by the domain experts

2. Determine the height and the functions to be used for the graph.

3. For each tuple in the training set , propagate it through the network and evaluate the output prediction to the actual result. If the predication is accurate adjust labels to ensure that this prediction has a higher output weight the next time. If the predication is not correct, adjust the weights to provide a lower output value for this class.

4. For each tuple, propagate through the network and make the appropriate classification.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download