ࡱ> bda7 [0bjbjUU mZ7|7|A,l4FhxLTF)@$$"FFFFFF$)&)&)&)&)&)&)$* -J)FFFFFJ)^"FF_)^"^"^"FFF$)^"F$)^"^"$)$)F `/a2KF.0$)$)u)0)$)-^"-$)^"FFCluster Analysis Example: SAS program (in blue) and output (in black) interleaved with comments (in red) Title Cluster Analysis for Hypothetical Data; data t; input cid $ 1-2 income educ; cards; c1 5 5 c2 6 6 c3 15 14 c4 16 15 c5 25 20 c6 30 19 run; proc cluster simple noeigen method=centroid rmsstd rsquare nonorm out=tree; id cid; var income educ; run; The SIMPLE option displays simple, descriptive statistics. The NOEIGEN option suppresses computation of eigenvalues. Specifying the NOEIGEN option saves time if the number of variables is large, but it should be used only if the variables are nearly uncorrelated or if you are not interested in the cubic clustering criterion. The METHOD= specification determines the clustering method used by the procedure. Here, we are using CENTROID method. The RMSSTD option displays the root-mean-square standard deviation of each cluster. The RSQUARE option displays the R2 and semipartial R2 to evaluate cluster solution. The NONORM option prevents the distances from being normalized to unit mean or unit root mean square with most methods. The values of the ID variable identify observations in the displayed cluster history and in the OUTTREE= data set. If the ID statement is omitted, each observation is denoted by OBn, where n is the observation number. The VAR statement lists numeric variables to be used in the cluster analysis. If you omit the VAR statement, all numeric variables not listed in other statements are used. proc tree data=tree out=clus3 nclusters=3; id cid; copy income educ; The TREE procedure produces a tree diagram, also known as a dendrogram or phenogram, using a data set created by the CLUSTER procedure. The CLUSTER procedure creates output data sets that contain the results of hierarchical clustering as a tree structure. The TREE procedure uses the output data set to produce a diagram of the tree structure. The NCLUSTERS= option specifies the number of clusters desired in the OUT= data set. The ID variable is used to identify the objects (leaves) in the tree on the output. The ID variable can be a character or numeric variable of any length. The COPY statement specifies one or more character or numeric variables to be copied to the OUT= data set. proc sort; by cluster; proc print; by cluster; var cid income educ; title2 '3-cluster solution'; run; The above commands yield the following SAS output: Cluster Analysis for Hypothetical Data 1 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Variable Mean Std Dev Skewness Kurtosis Bimodality income 16.1667 9.9883 0.2684 -1.4015 0.2211 educ 13.1667 6.3692 -0.4510 -1.8108 0.2711 Root-Mean-Square Total-Sample Standard Deviation = 8.376555 Cluster History RMS Centroid NCL -Clusters Joined-- FREQ STD SPRSQ RSQ Distance 5 c1 c2 2 0.7071 0.0014 .999 1.4142 4 c3 c4 2 0.7071 0.0014 .997 1.4142 3 c5 c6 2 2.5495 0.0185 .979 5.099 2 CL4 CL3 4 5.5227 0.2409 .738 13 1 CL5 CL2 6 8.3766 0.7378 .000 19.704 The statistics above provide information about the cluster solution. RMSSTD is the pooled standard deviation of all the variables forming the cluster. Since the objective of cluster analysis is to form homogeneous groups, the RMSSTD of a cluster should be as small as possible. SPRSQ (semipartial R-sqaured) is a measure of the homogeneity of merged clusters, so SPRSQ is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. Thus, the SPRSQ value should be small to imply that we are merging two homogeneous groups. RSQ (R-squared) measures the extent to which groups or clusters are different from each other (so, when you have just one cluster RSQ value is, intuitively, zero). Thus, the RSQ value should be high. Centroid Distance is simply the Euclidian distance between the centroid of the two clusters that are to be joined or merged. So, Centroid Distance is a measure of the homogeneity of merged clusters and the value should be small. Cluster Analysis for Hypothetical Data 2 3-cluster solution CLUSTER=1 Obs cid income educ 1 c1 5 5 2 c2 6 6 CLUSTER=2 Obs cid income educ 3 c3 15 14 4 c4 16 15 CLUSTER=3 Obs cid income educ 5 c5 25 20 6 c6 30 19  Title Non-Hierarchical Cluster Analysis of Hypothetical Data; data t2; input cid $ 1-2 income educ; cards; c1 5 5 c2 6 6 c3 15 14 c4 16 15 c5 25 20 c6 30 19 run; proc fastclus radius=0 replace=full maxclusters=3 maxiter=20 list distance; id cid; var income educ; run; You must specify either the MAXCLUSTERS= or the RADIUS= argument in the PROC FASTCLUS statement The RADIUS= option establishes the minimum distance criterion for selecting new seeds. No observation is considered as a new seed unless its minimum distance to previous seeds exceeds the value given by the RADIUS= option. The default value is 0. The MAXCLUSTERS= option specifies the maximum number of clusters allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed. The REPLACE= option specifies how seed replacement is performed. FULL requests default seed replacement. PART requests seed replacement only when the distance between the observation and the closest seed is greater than the minimum distance between seeds. NONE suppresses seed replacement. RANDOM selects a simple pseudo-random sample of complete observations as initial cluster seeds. The MAXITER= option specifies the maximum number of iterations for recomputing cluster seeds. When the value of the MAXITER= option is greater than 0, each observation is assigned to the nearest seed, and the seeds are recomputed as the means of the clusters. The LIST option lists all observations, giving the value of the ID variable (if any), the number of the cluster to which the observation is assigned, and the distance between the observation and the final cluster seed. The DISTANCE option computes distances between the cluster means. The ID variable, which can be character or numeric, identifies observations on the output when you specify the LIST option. The VAR statement lists the numeric variables to be used in the cluster analysis. If you omit the VAR statement, all numeric variables not listed in other statements are used. Non-Hierarchical Cluster Analysis of Hypothetical Data 1 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=20 Converge=0.02 Initial Seeds Cluster income educ 1 5.00000000 5.00000000 2 30.00000000 19.00000000 3 16.00000000 15.00000000 Minimum Distance Between Initial Seeds = 14.56022 Iteration History Relative Change in Cluster Seeds Iteration Criterion 1 2 3 1 1.5811 0.0486 0.1751 0.0486 2 1.1180 0 0 0 Convergence criterion is satisfied. Here, the cluster solution at the second iteration is the final cluster solution because the change in cluster seeds at the second iteration is less than the convergence criterion. Note that a zero change in the centroid of the cluster seeds for the second iteration implies that the reallocation did not result in any reassignment of observations. Cluster Listing Distance from Obs cid Cluster Seed 1 c1 1 0.7071 2 c2 1 0.7071 3 c3 3 0.7071 4 c4 3 0.7071 5 c5 2 2.5495 6 c6 2 2.5495 Criterion Based on Final Seeds = 1.1180 Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Cluster Frequency Deviation to Observation Exceeded Cluster 1 2 0.7071 0.7071 3 2 2 2.5495 2.5495 3 3 2 0.7071 0.7071 2 Non-Hierarchical Cluster Analysis of Hypothetical Data 2 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=20 Converge=0.02 Cluster Summary Distance Between Cluster Cluster Centroids 1 13.4536 2 13.0000 3 13.0000 The statistics used for the evaluation of the cluster solution are the same as in the hierarchical cluster analysis. Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) income 9.98833 2.12132 0.972937 35.950617 educ 6.36920 0.70711 0.992605 134.222222 OVER-ALL 8.37655 1.58114 0.978622 45.777778 The cluster solution can also be evaluated with respect to each lustering variable. If the measurement scales are not the same, then for each variable one should obtain the ration of the respective Within STD to the Total STD, and compare this ration across the variables. Pseudo F Statistic = 68.67 Approximate Expected Over-All R-Squared = . Cubic Clustering Criterion = . WARNING: The two values above are invalid for correlated variables. Cluster Means Cluster income educ 1 5.50000000 5.50000000 2 27.50000000 19.50000000 3 15.50000000 14.50000000 Cluster Standard Deviations Cluster income educ 1 0.707106781 0.707106781 2 3.535533906 0.707106781 3 0.707106781 0.707106781 Non-Hierarchical Cluster Analysis of Hypothetical Data 3 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=20 Converge=0.02 Distance Between Cluster Centroids Nearest Cluster 1 2 3 1 . 26.07680962 13.45362405 2 26.07680962 . 13.00000000 3 13.45362405 13.00000000 . PAGE  PAGE 1 ij dgj(־h[M6B*CJOJQJ]phB*CJOJQJ^Jph.B*CJOJQJ^JaJfHphq B*CJOJQJph.B*CJOJQJ^JaJfHphq 45B*CJOJQJ\^JaJfHphq .B*CJOJQJ^JaJfHphq %CJOJQJ^JaJfHq +5CJOJQJ\^JaJfHq GjkKSdij)~7$8$H$ $7$8$H$a$A0Z0L   U _ c l / 2 5 i j ʺʺʺʺʢo.B*CJOJQJ^JaJfHphq 45B*CJOJQJ\^JaJfHphq .B*CJOJQJ^JaJfHphq 6B*CJOJQJ]^JphB*CJOJQJ^Jph6B*CJOJQJ]phB*CJOJQJph6B*CJH*OJQJ]ph$L&    q a  / 4 5 i j 97$8$H$j GNTZ`flrz  '(,24:<@BHPQlmqwy 56:@BHJNPV+5CJOJQJ\^JaJfHq %CJOJQJ^JaJfHq T9:2wKWX<=7$8$H$VX<'(68?ACDRT[]hi  !645B*CJOJQJ\^JaJfHphq .B*CJOJQJ^JaJfHphq jU+5CJOJQJ\^JaJfHq %CJOJQJ^JaJfHq ;%&B^_`jk/7$8$H$/6=DMV_hmn91(.GL^^7$8$H$6hknrs{')48 ͵͵͵͵͵͵͵}g}g}g}g}g}g}g}g}g}g}g}+5CJOJQJ\^JaJfHq %CJOJQJ^JaJfHq B*CJOJQJ^Jph.B*CJOJQJ^JaJfHphq .B*CJOJQJ^JaJfHphq 45B*CJOJQJ\^JaJfHphq .B*CJOJQJ^JaJfHphq (L(k9:WX 4 5 6 h i !B!}!!!7$8$H$  ! ( 3 _ g !!J!K!R!X!^!d!j!p!v!|!!!!!!!!!!!!<###$$ $$$$)$*$0$6$<$=$N$O$U$[$a$b$s$t$z$$$$$$$$$$$$$$$$6&7&F&G&M&S&`&f&z&{&&%CJOJQJ^JaJfHq +5CJOJQJ\^JaJfHq T!!!<#=#X#Y#~####$7$\$$$$$$$#%$%X%%%3&|&&'7$8$H$&&&&&&&&&&&&&&&&&& ' 'V'W'''''''''"(#(4(;(?(@(Q(X(\(](n(u(w((Y)Z)))))))))))* ****(*8*?*F*M*S*[*b*k*m*+++++++,,,,,,,%CJOJQJ^JaJfHq +5CJOJQJ\^JaJfHq T''X'Y'p''''''((<(Y(v(w(((())`))))*l*m*++7$8$H$+++++++8,9,V,W,,,,-3-4-5-Y-Z---- .6.7.....7$8$H$,,,,,- - -- -'-2------------ . ...#.*.5.~..........*/+/0@0A0B0H0I0J0L0M0j0JCJU0J j0JU%CJOJQJ^JaJfHq +5CJOJQJ\^JaJfHq L.. / //P/// 0@0A0J0K0L0W0X0Y0Z0[0h]h&`#$7$8$H$M0S0T0U0V0W0[00JCJmHnHuj0JCJU0JCJ 1h/ =!"#$%'&P1h0= /!"#$% 1h/ =!"#$%Dd-0  # A2=XAcEy7# f!匉>70x)(c21㳉g"|NjM11@3 Pbb s CO1eqK\xF)&&{ x$>&JG)ʣ2d"o*^uÑKT: GKzK9 Z?uV6jY-E߂no]V (^Cū~82XGKȀbI?P,#%pd@* (~;rObV;}[wUÝ!_>5*CdPW}>HTG}Aep G}AIu$g3 RA/wTC}^ CP_>XyO yWd(Pu0Q][*^ CūPx5 y~x5 G>P, #%PaD0TQ}> FTCd|D|Ďa728LFTghh2;,d헆EL0fo/B #s)lGaë(xW <#GBys ba%a %ʣ$H%ʣRQG(z*:^&H*dP&G ˃pB(.$L+ %<2x=qfpŨKìByTQ<*^ʣQNo;ΰnS;;Uηm*;PT{VM%;PT, G)@S>[!vA)>ݧ >P4ݧN4ޯvꃨWlJWl?y}PzbA)۟Eb>(}@Y)۟E}P}?~F****>UULTQDV?tx96]!/j;UǪC?^F}DuT#j;^Tqu6]QvGQwj;UEݡ:hWqu6]QUAݡ:xQGUG6eGTGO^BcuhUсvՃmt`6:]`خzlW=*֥z'OUGBۼ: k6z )QhWTG[u1SUGm:[u1SUGm:[u1SUGm^QnQiWTGWhWUGu~MuOhzR{,=aE#DTyĘ08 Pl? Plҽ=0Wq*^֫zz OUU, %=dW弇7ԞKbP{b>xyxyx/P^C>Z[Kp-ŒQ=1yT{ !ūԍj/YyM7+^mUxՈzK v.)+2o3mQDJތ&q~!7~~g2?_;Ϧ xdbj=0x&Lx*߿73ꇂ{_ 1!(ԛ B]QIyR}&#J=#4_XhwPD6s MvjCjh7oݪ?_}O]59&_C_ׂ_jLso̴OMSGx 6ϩP+L,>R)&&{ ߦqx9M?te/3}-D!Wb i_ςH?ȉ髯"7GJ~;81d'Q?I=}B'TnNL91*B*ph, @, Footer  !&)@!& Page Number,2, Header  !^B@B^ Body Text 7$8$H$.B*CJOJQJ^JaJfHphq [, Z "Z KZGjkKSdij)~L&qa / 4 5 i j 9 :   2 w   K  W X <=%&B^_`jk/6=DMV_hmn91(.GL(k9:WX456hiB}<=XY~ 7 \ #!$!X!!!3"|""##X#Y#p######$$<$Y$v$w$$$$%%`%%%)&l&m&''''''''8(9(V(W(((()3)4)5)Y)Z)))) *6*7****** + ++P+++ ,@,A,L,W,X,\,000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000@0@0@0 0 j V6 &,M0[0 #%'*,9/L!'+.[0!"$&()+Z0 !!4"*+129:@SV^bU_cl    wxV^  $kn)-;<BCs{&A`"f"""""""#######$$$%%&&~(((((())))))**5******+P++@,A,I,L,V,Y,\,GRjKMSVdgi&    / 2 4  9 : @   P U   W  /468=?DFMOVX_ahkmnr (.6G 9~ X!u!x!!p#x###%%%%)&''''''''''''****++++++++ ,*,?,@,A,I,L,V,Y,\,3333333333333333333333333333333333333333333333333333333333333333 Jongbin LeeRC:\WINDOWS\Application Data\Microsoft\Word\AutoRecovery save of ClusterExample.asd Jongbin Lee C:\folders\ta\ClusterExample.doc Jongbin LeeRC:\WINDOWS\Application Data\Microsoft\Word\AutoRecovery save of ClusterExample.asd Jongbin LeeRC:\WINDOWS\Application Data\Microsoft\Word\AutoRecovery save of ClusterExample.asd Jongbin LeeRC:\WINDOWS\Application Data\Microsoft\Word\AutoRecovery save of ClusterExample.asd Jongbin LeeRC:\WINDOWS\Application Data\Microsoft\Word\AutoRecovery save of ClusterExample.asd Jongbin LeeRC:\WINDOWS\Application Data\Microsoft\Word\AutoRecovery save of ClusterExample.asd Jongbin LeeRC:\WINDOWS\Application Data\Microsoft\Word\AutoRecovery save of ClusterExample.asd Jongbin LeeRC:\WINDOWS\Application Data\Microsoft\Word\AutoRecovery save of ClusterExample.asdSOM User1C:\nk\Teaching\DBM\Assignments\ClusterExample.doc@[,@UnknownG:Times New Roman5Symbol3& :Arial?5 :Courier NewI& : ?Arial Unicode MS"1hRKRKf}$M!x0,2QCluster Analysis Example: Jongbin LeeSOM User Oh+'0 $ @ L Xdlt|Cluster Analysis Example:0lus Jongbin LeeongongNormal  SOM Useree2M Microsoft Word 9.0a@G@з2K@з2Kf}$ ՜.+,0 hp|  eM, Cluster Analysis Example: Title  !"#$%&'()*+,-/0123456789;<=>?@ABCDEFGHIJKLMNOPRSTUVWXZ[\]^_`cRoot Entry F@q2KeData .1Table:-WordDocumentmZSummaryInformation(QDocumentSummaryInformation8YCompObjjObjectPool@q2K@q2K  FMicrosoft Word Document MSWordDocWord.Document.89q