Washington State University



Featherman’s T-SQL Adventures – T-SQL Programming: WHILE Loops and Double Processing of Data.This module was created in response to Tableau’s LOD calculations. You just need fire up SQL Server Management Studio, then use a digital copy of this doc and copy/paste/run the SQL code on the left hand side. Just run the code and you will have an interactive learning edutainment experience. The SQL skills refined by running each query in this completing this document will go a long way to empowering the business analyst or DBA.Sometimes the data needed however, is not compiled and usable for analysis. Data Munging (reformatting and compressing) is the skill of preparing datasets. The data munging solution shown here is a two-stage compression. Step 1 utilizes a GROUP BY () query to condense the dataset from ~10,000 records to ~800 records that reside in the working main memory (RAM) of the database engine. Different SQL tools are next used to perform the second data condensing to reduce the data from the ~800 records to < 20). We deploy several methodologies for the second data condensing including WHILE loops, and CASE statements. You can extrapolate the procedures used here to do 10-step ETL processes (common).The result is a dataset that has been compressed 500:1 and serves as the datasource for charts that examine a particularly difficult question, which is “What percentage of our sales come from customers that completed one trasaction. What % of sales come from customers that have completed two transactions, and so on. Please pause and think about how you would handle this analysis. DBA’s and business analysts pre-process data before researching it, and use procedures as demonstrated here. The analyst seeking to extract compiled information from a database may consider using loops (and related functionality cursors) to compile datasets. The demos here solve a tough problem; data often has to be condensed in different steps. For the second processing stage, several solutions are provided for your learning. WHILE loops are demonstrated to examine each record and perform further data transformation and compilation. CASE and a surprise solution are also shown. The problem solving mindset of the DBA, or analyst demonstrated here is to remember to break big problems into little pieces, and execute.Our research question is “how many customers ordered only once from our company? How many customers ordered twice? What % of our customers have ordered < 10 times? The analysis is very useful to measure the engagement level of different target markets, and the effectivity of bundling promotions, new product acceptance, price reductions, etc. We start with taking another look at creating variables, setting their values using a SELECT query and display messages of program progress and execution using the PRINT function.Ok saddle up!!USE [Featherman_Analytics];DECLARE @NumberLineItems int DECLARE @NumberOrders int DECLARE @NumberCustomer intDECLARE @AvgNumOrders numeric(5,2)SET @NumberOrders = (SELECT COUNT(DISTINCT([Order ID])) FROM [dbo].[SuperStore_Orders])SET @NumberLineItems = (SELECT COUNT([Order ID]) FROM [dbo].[SuperStore_Orders])SET @NumberCustomer = (SELECT COUNT(DISTINCT([Customer ID])) FROM [dbo].[SuperStore_Orders])SET @AvgNumOrders = (@NumberOrders * 1.0)/ @NumberCustomerPRINT 'There are ' + CAST(@NumberOrders AS VARCHAR(6)) + ' orders with a total of ' + CAST(@NumberLineItems AS VARCHAR(6)) + ' line items'PRINT 'There are ' + CAST(@NumberCustomer AS VARCHAR(6)) + ' customers whom average ' + CAST(@AvgNumOrders AS VARCHAR(8)) + ' purchases each'SELECT [Customer Name], COUNT(DISTINCT([Order ID])) as [# Orders]FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]-2540187769500This query on the left produces 2 types of output; messages and data results (the resultset). Please run this query now and compare your results. The messages are generated when you use the PRINT function. The problem with PRINT however is that it wants to print words only, so you have to use a CAST or CONVERT function to turn any calculated measure into a text field (here varchar – variable number of characters up to a specified maximum number). It is very common to create a variable then run a query to assign a value to that variable.It is common for DBA’s to use PRINT statements to display the value of a variable as a type of error-checking. The next section counts the number of orders each customer had. Because the order number is repeated in the dataset (as part of a composite primary key) we have to use the COUNT(DISTINCT([column name])) function. It.is interesting to see SQL sort the results by the first letter of the first column, without even being asked. Details follow.263677436302100-546261736200Lines 3-6 declare integer or numeric variables. All SQL variables start with the @ symbol (or @@). It is a programming best practice to stack all variables at the top of the program.Lines 8 – 13 run SQL statements, each of which return one integer value which the SET term assigns to the variable. When you see the = sign, this is called an assignment operator and the action is from right to left. The key term SET @variable = instructs SSMS to assign a value to the variable.The Superstore_orders table has repeating values in the Order ID field as there typically are multiple line items per order. Therefore we must be careful to use (SELECT COUNT(DISTINCT([Order ID])) to retrieve the correct number of values. . Said another way, are querying a denormalized fact table that has repetitive values in the customer ID and customer name columns so we need to be careful to count the DISTINCT values for customer ID.Line 15 calculates an average and assigns it to a variable. Integer division in SQL will round the answer to an integer, so we use a trick of multiplying the numerator by 1.0, it’s an old workaround.-2540122428000Lines 17-22 produce the message shown above in the SSMS query window. Recall that text in SSMS is red and uses ‘single quotes’ (as opposed to SSIS and excel which uses “double quotes”). The CAST function converts numeric variables to text variables, you have to specify how many letters the variable can hold maximum. The + sign is used to concatenate (append) terms together. We concatenate (append) words and numbers together to form a coherent sentence. Each PRINT term starts a new line. Lines 24-26 build a condensed table of results. This table will be copied into a virtual table and further condensed. The next iteration shows how. The PRINT functions are removed to make the learning easier.USE [Featherman_Analytics];DECLARE @CustomerTotals TABLE ([Customer Name] nvarchar(50), [# Orders] int)INSERT INTO @CustomerTotals([Customer Name], [# Orders])SELECT [Customer Name], COUNT(DISTINCT([Order ID]))FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]SELECT * FROM @CustomerTotals WHERE [# Orders] = 2SELECT COUNT(*) FROM @CustomerTotals WHERE [# Orders] = 2-2540000This code block shows how to insert the condensed data into an array table (aka table variables or temp tables). Just to show that the task was accomplished we show some results (easier than using PRINT).The INSERT INTO code work together to receive the data provided via the SELECT FROM GROUP BY() statement. The result is that ~800 rows of data are loaded into the working memory of the SQL Server database.The two SELECT statements at the bottom produce the results shown here. Go ahead and copy the SQL syntax into a new SSMS query and run it. You can change the number in the where clause. This image displays the list of customers that had two sales, and the fact that there are 10 total customers in the dataset that made 10 purchases.Please see prior modules on creating table arrays. Line 3 declares the table array which is a variable thus the @ sign in the name. This table array has 2 columns, a text column and an integer column. We refer to these by name and these are the names that will bubble up into the reports, so take some time to declare them.Line 5 specifies that we will insert some data into the table array and which columns of the table array that will receive the data.Lines 7 through 9 generate a condensed list of data, the name of the customer and how many purchases they made.To demonstrate that we did in fact load the table array with data, we use a SELECT statement to display some of the data. The two queries are parameterized to provide a short list of customers that made two purchases and a count of the number of customer that made two purchases.USE [Featherman_Analytics];DECLARE @CustomerTotals TABLE ([Customer Name] nvarchar(50), [# Orders] int, Ranking nvarchar(30))INSERT INTO @CustomerTotals([Customer Name], [# Orders], Ranking)SELECT [Customer Name], COUNT(DISTINCT([Order ID])), CASE NTILE(4) OVER(ORDER BY COUNT(DISTINCT([Order ID])) DESC) WHEN 1 THEN 'Top Customer'WHEN 2 THEN 'Good Customer'WHEN 3 THEN 'Emerging Customer'WHEN 4 THEN 'Bottom quartile Customer'END FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]ORDER BY [Customer Name]SELECT * FROM @CustomerTotals ORDER BY [# Orders] DESC0000Before going forward with the analytics mentioned to gauge customer engagement, let’s take a look at ranking the customers so the marketing team can better communicate with them.This code reviews NTILE() which adds a column of derived data that places a 1,2,3 or 4 (in this case) since we specified NTILE(4) which analyzes all the values for the column, calculates quartiles, and then records the quartile the record is in. In this case we place the customers in one of 4 categories ranked from “Top customer” to “Bottom quartile customer”. If we removed the term CASE from the NTILE line and made the new Ranking column of the array table an integer data type, then we could place the number in the column.To enhance the interface however, we use a CASE statement that examines the quartile assigned to the record and convert it to the textual descriptor. Notice the first highlighted text, the new column “Ranking” is a nvarchar text field to hold the words.We will use CASE more in this document and it will become understandable as you use it. Please copy and run this query and change the terms in single quotes. Rather than use CASE in many ways, research suggests that usually CASE is used as part of a SELECT statement to create a new column of descriptive information (useful for an Excel Slicer). Here we combine the power of NTILE and CASE to provide useful functionality.USE [Featherman_Analytics];DECLARE @CustomerTotals TABLE ([Customer Name] nvarchar(50), [# Orders] int, Ranking nvarchar(30))INSERT INTO @CustomerTotals([Customer Name], [# Orders], Ranking)SELECT [Customer Name], COUNT(DISTINCT([Order ID])), CASE WHEN COUNT(DISTINCT([Order ID])) > 10 THEN 'Top Customer' WHEN COUNT(DISTINCT([Order ID])) BETWEEN 7 AND 10 THEN 'Good Customer' WHEN COUNT(DISTINCT([Order ID])) BETWEEN 4 AND 6 THEN 'Emerging Customer' WHEN COUNT(DISTINCT([Order ID])) BETWEEN 1 AND 3 THEN 'Bottom quartile Customer'END FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]ORDER BY [Customer Name]SELECT * FROM @CustomerTotals ORDER BY [# Orders] DESC0000This demonstration used to show an alternate way of performing the calculation. This code provides the same funcitonality and output as the prior example, however the methodology is different. There usually is more than one way to solve a data problem. In the prior demonstration we used the CASE statement to turn the 1,2,3,4 into a textual descriptor.Later in this module we will use CASE statements again, so let’s give a more straightforward example rather than conflate it with NTILE().Of course the beauty of NTILE() is that NTILE() figures out the range of the data and calculates the quartiles. If you however do know the numeric ranges (buckets, bins) that you want to split the data into then you can use CASE processing pretty easily as revealed here.Case is like many if statements if this then do that, otherwise do the other thing. Please run the query to the left and perhaps change the values in the BETWEEN statements to alter the results. The between statements provide the condition being tested, and the value written to the column depends on how the value is being tested. The beauty here is that because you are using CASE processing within a SELECT statement the code acts to evalueate every single row of data in the table. Amazing!Unlike other programming languiages where you specify the value being test only once - COUNT(DISTINCT([Order ID])) – In SQL you have to specify it over and over.Notice the ranges should be mutually exclusive and would need to be a bit different if we were not dealing with integer values. You can also test for words.With the goal of counting of how many customers made any # of purchases (ie, how many customers made 1 purchase, 2 purchases, etc.); now that we have covered the first data condensing, and you know how to count up rows in the array table, let’s see how to use a loop to complete the job. We already know how many unique customers we have and total number of sales so we omit that code and move on. A tricky area is that we have to figure out how many times to loop the table array, this type of code can blow up easily if not set up correctly. USE [Featherman_Analytics];DECLARE @Counter INT = 1 DECLARE @MaxOrders INT DECLARE @NumCust INTDECLARE @CustomerTotals TABLE ([Customer Name] nvarchar(50), [# Orders] int)INSERT INTO @CustomerTotals ([Customer Name], [# Orders])SELECT [Customer Name], COUNT(DISTINCT([Order ID])) as [# Orders]FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]SET @MaxOrders = (SELECT MAX([# Orders]) FROM @CustomerTotals)WHILE @Counter < = @MaxOrders BEGIN SET @NumCust = (SELECT COUNT(*) FROM @CustomerTotals WHERE [# Orders] = @Counter)PRINT CAST(@NumCust AS VARCHAR(3)) + ' Customers had ' + CAST(@Counter AS VARCHAR(2)) + ' orders'SET @Counter = @Counter + 1 ENDPRINT 'Thats how we do it'42545109220Please take notice of the careful use of indenting. It organizes the code. You can see for example where the loop begins and ends and the actual code that is repeated.00Please take notice of the careful use of indenting. It organizes the code. You can see for example where the loop begins and ends and the actual code that is repeated.63563500Run the query to learn its functionality. This module demonstrates how to run a block of code over and over, here performing a SELECT COUNT and PRINTing the results.Apparently the loop (and therefore the SQL SELECT statement) runs 37 times.The loop syntax SQL currently uses is a WHILE loop. You start the WHILE loop by specifying the exit condition, i.e., when to stop the loop (you don’t want a runaway loop!). While loops need a loop control variable (here @Counter) to keep count of how many times the loop has run. Notice at the bottom of the loop before the END statement, that you have to increment the loop control variable.Next you place the code you want to run over and over inside the terms Begin and End (just like declaring a stored procedure).In the loop we repeatedly query the 397 row virtual table. 37 queries are run which is the beauty of looping, you can run the same code over and over. So we calculate how many customers had 1 order, then had 2 orders, up to the max, 37 orders.Printing messages is an intermediate step, we need to turn the results into a bar chart.-23074258921This code is very condensed and powerful which is great, but it takes some time to decipher. When you do understand it however (especially the loop), you should make a great leap forward in your understanding of programming.The genius is that the @Counter variable is incremented and used both as the loop control variable and also as the WHERE clause value. The value of @Counter changes 37 times, with the values 1 to 37. The SQL query runs 37 times, isn’t that amazing?To learn this code quickly, print it and draw arrows connecting the declaration of each variable to its usage.00This code is very condensed and powerful which is great, but it takes some time to decipher. When you do understand it however (especially the loop), you should make a great leap forward in your understanding of programming.The genius is that the @Counter variable is incremented and used both as the loop control variable and also as the WHERE clause value. The value of @Counter changes 37 times, with the values 1 to 37. The SQL query runs 37 times, isn’t that amazing?To learn this code quickly, print it and draw arrows connecting the declaration of each variable to its usage.Most structured programs recommend declaring all variables needed at the top of the program so that they are organized and kept in one place (rather than strewn about the program).Line 3 declares the @Counter an integer variable. This will be used to control the loop (the loop control variable) and it gets incremented on line 24 inside the loop. @Counter is also cleverly used in the line 22 PRINT statement (more on that later).Line 4 declares the @MaxOrders integer variable, which has its value assigned on line 14, by a simple query of the first condensed dataset stored in the array table. This variable after its gets its value assigned is used as the stop value for the loop. WE must always set an exit plan for the loop.Line 5 declares the @NumCust integer is used inside the loop to provide the PRINT out. It gets its value set on line 18.Line 6 creates the column structure for the array table. As mentioned lines 8 – 12 run the SELECT statement that load the data into the array table.Line 14 declares the integer variable @MaxOrders. The MAX function finds the upper value in the column of data which is the highest number of orders that a customer had. Here 1 customer had 37 orders, so the value 37 is assigned to @MaxOrders Line 16 starts the loop which in this case will run as long as @Counter is <= 37. When @Counter becomes 38 then SQL jumps past the END statement.Line 18/19 counts how many customers in the array table had @Counter orders. The first time through the loop then the line counts how many customers had 1 order. Since the @Counter value becomes 2 before running the query again, the next time you run the query.Line 21 prints out the message for example 34 customers had 2 orders. Line 24 increments the loop control variable, Line 25 ends the loop.USE [Featherman_Analytics];DECLARE @Counter INT = 1 DECLARE @MaxOrders INT DECLARE @NumCust INTDECLARE @CustomerTotals TABLE ([Customer Name] nvarchar(50), [# Orders] int)DECLARE @TotalsTable TABLE ([# Orders] int, [# Customers] int)INSERT INTO @CustomerTotals([Customer Name], [# Orders])SELECT [Customer Name], COUNT(DISTINCT([Order ID])) FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]SET @MaxOrders = (SELECT MAX([# Orders]) FROM @CustomerTotals)WHILE @Counter < = @MaxOrders BEGIN SET @NumCust = (SELECT COUNT(*) FROM @CustomerTotals WHERE [# Orders] = @Counter) INSERT INTO @TotalsTable ([# Orders], [# Customers]) VALUES (@Counter, @NumCust) SET @Counter = @Counter + 1 ENDSELECT * FROM @TotalsTable-1270000This query is changed to produce a table of values that can be copied (or queried into) into Excel or Tableau, etc. or perhaps stored into a permanent SQL table. This is solution we need.A second array table is created to store the values of the 37 queries. The @TotalsTable has two columns, the first identifying how many orders, and the second identifying how many customers had that number of orders. The analysis is very useful to measure engagement level of different target markets.Inside the loop we generate the values @Counter (the # of orders being researched – i.e, how many customers had 10 orders) and @NumCust – the answer to the question – the # of customers.We can use an INSERT INTO statement to add a new row to the new @TotalsTable array table. This simple INSERT INTO statement is run over and over inside the loop. This is a powerful concept that can be used to advantage to solve different problems.To generate the actual table that can be copied into Excel, we finish with a SELECT statement to pull the values from the array table into SQL results.-185606444000Possible improvements would be to add slicers on different target markets, a time dimension, geographic location, etc.USE [Featherman_Analytics];DECLARE @CustomerTotals TABLE ([Customer Name] nvarchar(50), [# Orders] int, Bin nvarchar(10))INSERT INTO @CustomerTotals([Customer Name], [# Orders], Bin)SELECT [Customer Name], COUNT([Order ID]) as [# Orders], ' 'FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]SELECT * FROM @CustomerTotalsUPDATE @CustomerTotals SET Bin = CASE WHEN [# Orders] BETWEEN 1 and 5 then '01 or 05' WHEN [# Orders] BETWEEN 6 and 10 then '06 to 10' WHEN [# Orders] BETWEEN 11 and 15 then '11 to 15' WHEN [# Orders] BETWEEN 16 and 20 then '16 to 20' WHEN [# Orders] BETWEEN 21 and 25 then '21 to 25' WHEN [# Orders] BETWEEN 26 and 30 then '26 to 30' WHEN [# Orders] BETWEEN 31 and 35 then '31 to 35' WHEN [# Orders] BETWEEN 36 and 40 then '36 to 40' WHEN [# Orders] > 40 then 'More than 40'ELSE 'customer not discretized'ENDSELECT * FROM @CustomerTotalsSELECT Bin, COUNT(*) as [Count]FROM @CustomerTotalsGROUP BY Bin137925221394700714985681300-1841545720The code methodology in this and the next page is correct, but there is a small error. This professor introduces the error to demonstrate how easy it is for analysts to have functioning but erroneous code which can drive the wrong management decision. The first SELECT statement should read COUNT(DISTINCT([Order ID])) which counts # of order line items not number of orders.00The code methodology in this and the next page is correct, but there is a small error. This professor introduces the error to demonstrate how easy it is for analysts to have functioning but erroneous code which can drive the wrong management decision. The first SELECT statement should read COUNT(DISTINCT([Order ID])) which counts # of order line items not number of orders.Often there are too many values in a column (i.e., the cardinality is too high) and a need arises to reduce the cardinality of the values to produce a report for upper management. Here we reduce the values from 37 down to 9. 2121575146080500130320142018900Putting the data into buckets, or bins is called discretization and is very common, and needed prior to making histograms which are typically 10 levels. You can see that if it was possible to reduce millions of values into 10 buckets, then data visualization and analysis get a lot easier! Please carefully read the UPDATE command, we are setting the value for each row in the Bin column of the the arraytable depending on the value of the [# Orders] column. The CASE statement places different values into all the rows of the Bin column.Line by line analysis to follow, but we compile the 10,000 rows of data down to the first table shown above, then fill the new Bin column using the CASE statement. Then we just use a GROUP BY query to count up the values for the bins.Fancy statistical software often provides this feature.67945415290000-257462796500Again the goal is to gauge customer loyalty as measured by how many times they have purchased from the SuperStore. We want to take many values and put them into groups (bins). This is a useful and common research technique, so if you have too many bars in your chart, remember that you can collapse the data by discretizing it (putting it into buckets).This code Line 3 declares the array table this time we add a third column which will be used to identify which bin the customer belongs in. Lines 6-10 again insert compiled data into the arraytable. Notice a trick though on line 8, we are inserting a space into the 3rd column (bin). This column later will receive a textual value, so it is actually is not necessary to put a space into the column. This is demonstrated however, because you might have to put a 0 into a column so that later you can add to it. If it was a numeric column you would use 0 not ‘ ‘ in the SELECT statement. Line 12 and 28 actually are not needed and are included to facilitate learning. When you run the program you receive intermediate output.In Line 14 – 26, we query the compiled data in the arraytable (not as in rows 6 to 8 where we queried the dbo.Superstore database table) to categorize the row of data and assign a value for the third column. Rather than use INSERT which is used to add rows of data, since we are filling in values for rows that already exist we use UPDATE SET. Since this is the second condensing we have to perform it in a second query.The classic CASE statement runs on each row of data rather than need a loop. Let’s think about that for a while. SQL is set based where an UPDATE works on a set of rows. Older procedural programming would have required a loop where each row was examined within a loop. Here SQL is set based so the UPDATE works on the entire set of records in the arraytable. Nice!). Lines 30-32 run a simple GROUP BY on the arraytable counting up the bin values by bin (ie 01 to 05). Be sure to include the 0 when specifying single digits so that the sort occurs as expected.Imagine you are the jr. analyst showing this ‘not fact-checked’ report to the marketing manager. Would you lose credibility? Is it easy to restore credibility after a mistake like this?USE [Featherman_Analytics];DECLARE @NumberOrdersForEachCustomer TABLE ([Customer Name] nvarchar(50), CustomerCategory nvarchar(30), [# Orders] int )/* now that we have an array, we place aggregated values into it, leaving a blank in the CustomerCategory field. */INSERT INTO @NumberOrdersForEachCustomer ([Customer Name], CustomerCategory, [# Orders])SELECT [Customer Name], CustomerCategory = ' ', COUNT(DISTINCT([Order ID]))FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]/* now go back and use an UPDATE SET command to change the values in the blank CustomerCategory column. */UPDATE @NumberOrdersForEachCustomer SET CustomerCategory = CASE WHEN [# Orders] > 10 THEN 'Top Customer' WHEN [# Orders] BETWEEN 7 AND 10 THEN 'Good Customer' WHEN [# Orders] BETWEEN 4 AND 6 THEN 'Emerging Customer' WHEN [# Orders] BETWEEN 1 AND 3 THEN 'Bottom quartile Customer' END /* now show the array and a quick GROUP BY query to summarize the data */SELECT * FROM @NumberOrdersForEachCustomer SELECT CustomerCategory, COUNT(*) as [Count]FROM @NumberOrdersForEachCustomerGROUP BY CustomerCategory-248516600This is another example of adding a new column to an array and placing a value into that is later used to categorize and slice up the data. This demo is transitionary to the next content.Again an array is created, that adds a new column not in the source data table. The column is created and blanks are placed into the column, using the INSERT INTO SELECT statements. Then an UPDATE SET command with the CASE WHEN statement is used to place values into the column. It does show a clean CASE statement though, values for every row of the CustomerCategory field are populated.Another INSERT INTO statement could save this data to a permanent table.USE [Featherman_Analytics];DECLARE @AvgNumOrdersCorrect numeric(5,2)/* now we declare an array, place aggregated values into it, leaving a blank in the CustomerCategory field. */DECLARE @NumberOrdersForEachCustomer TABLE ([Customer Name] nvarchar(50), CustomerCategory nvarchar(30), [# Orders] numeric(6,2))INSERT INTO @NumberOrdersForEachCustomer ([Customer Name], CustomerCategory, [# Orders])SELECT [Customer Name], CustomerCategory = ' ', COUNT(DISTINCT([Order ID])) FROM [dbo].[SuperStore_Orders] GROUP BY [Customer Name]SET @AvgNumOrdersCorrect = (SELECT AVG([# Orders]) FROM @NumberOrdersForEachCustomer)/* now go back and use an UPDATE SET command to change the values in the blank CustomerCategory column. Rather than know the ranges for the bins or categories we use a formula... you can bring in more advanced functions or calculations */UPDATE @NumberOrdersForEachCustomer SET CustomerCategory = CASE WHEN [# Orders] > @AvgNumOrdersCorrect * 1.74 THEN 'Top Customer' WHEN [# Orders] BETWEEN @AvgNumOrdersCorrect * 1.25 AND @AvgNumOrdersCorrect * 1.74 THEN 'Good Customer' WHEN [# Orders] BETWEEN @AvgNumOrdersCorrect * .75 AND @AvgNumOrdersCorrect * 1.24 THEN 'Emerging Customer' WHEN [# Orders] < @AvgNumOrdersCorrect *.75 THEN 'Low Frequency Customer' END /* now show the array and a quick GROUP BY query to summarize the data */SELECT * FROM @NumberOrdersForEachCustomer SELECT CustomerCategory, COUNT(*) as [Count]FROM @NumberOrdersForEachCustomer GROUP BY CustomerCategoryThe prior example is a lot of work to maintain. For example the ranges WHEN [# Orders] BETWEEN 4 AND 6 THEN 'Emerging Customer' would need to be updated if business picked up and you wanted to recategorize the customers, or if you wanted to take this query to another target market which has a different engagement.Here we calculate the average # of business transactions and then use that metric to calculate ranges to create customer segments. This demonstrates that rather than know the ranges for the bins or categories we use a formula... you can bring in more advanced functions or calculations.You can always use NTILE(), but this demo suggests that you can create unique metrics to categorize the market segments. Also this code would not need updating over time.Check the yellow variable - if you are calculating an average be sure to use a numeric or decimal variable to get correct results. -406401905000This output is created in line 29 - 32. Again this approach of categorizing customers is superior in that the code will not need maintenance. Strict ranges require the ranges to be updated as market conditions change. Again line 6 is very important to NOT use an integer data type – use some decimal, or numeric data type.So again we create an array and select data into it, adding a blank column. Next we use an update set command to fill the extra column. On line 10 because the order numbers are not a primary key by themselves there are duplicates, it is necessary to use COUNT(DISTINCT(fieldname). The CASE statement sets values for the CustomerCategory column, you could set values for different columns in the WHEN statements, but here just the CustomeCategory column is populated.USE [Featherman_Analytics];DECLARE @NumberOrdersForEachCustomer TABLE ([Customer Name] nvarchar(50), [Customer Category] nvarchar(30), [# Orders] numeric(6,2))INSERT INTO @NumberOrdersForEachCustomer ([Customer Name], [Customer Category], [# Orders])SELECT [Customer Name], [Customer Category] = CASE WHEN [# Orders] > @AvgNumOrdersCorrect * 1.74 THEN 'Top Customer' WHEN [# Orders] BETWEEN @AvgNumOrdersCorrect * 1.25 AND @AvgNumOrdersCorrect * 1.74 THEN 'Good Customer' WHEN [# Orders] BETWEEN @AvgNumOrdersCorrect * .75 AND @AvgNumOrdersCorrect * 1.24 THEN 'Emerging Customer' WHEN [# Orders] < @AvgNumOrdersCorrect *.75 THEN 'Low Frequency Customer' END , COUNT(DISTINCT([Order ID])) as [# Orders]FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]SELECT * FROM @NumberOrdersForEachCustomer SELECT [Customer Category], COUNT(*) as [Count]FROM @NumberOrdersForEachCustomerGROUP BY [Customer Category]Very often you will need to break the processing into several steps, such as before we created and loaded an array, dis some processing then used a separate UPDATE SET command to fill the column with data and get the job done.This iteration demonstrates that if the job is not too complex you can put the CASE statement together in the SELECT statement.Because this query does not use hard-coded ranges, and therefore is flexible to the dataset, it would intersting to slice this data by region and/or year to see the pattern of customers. Here the 80/20 rule is not in effect, rather the sales are spread out.USE [Featherman_Analytics];DECLARE @NumberCustomers AS INT = (SELECT COUNT(DISTINCT([Customer Name])) FROM [dbo].[SuperStore_Orders])DECLARE @NumberOrdersForEachCustomer TABLE ([Customer Name] nvarchar(50), [# Orders] int)INSERT INTO @NumberOrdersForEachCustomer ([Customer Name], [# Orders])SELECT [Customer Name], COUNT(DISTINCT([Order ID]))FROM [dbo].[SuperStore_Orders]GROUP BY [Customer Name]SELECT [# Orders], COUNT([# Orders]) AS [# Customers] , (COUNT([# Orders]) * 1.0)/@NumberCustomers AS [% of Total],SUM((COUNT([# Orders]) * 1.0)/@NumberCustomers) OVER (ORDER BY [# Orders]RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS [Running Total]FROM @NumberOrdersForEachCustomer GROUP BY [# Orders] ORDER BY [# Orders]2061845116840The resultset was copied into Excel then formatted nicely. This is standard procedure when you have percent values. The way to format percentages has been shown, and is omitted here to ease learning.00The resultset was copied into Excel then formatted nicely. This is standard procedure when you have percent values. The way to format percentages has been shown, and is omitted here to ease learning.See how easy it is to drive incorrect reporting? This is an important point. When providing a report for the first time, verify its accuracy with the customer while the report is ‘still in development.’ We here extend the prior analysis, however we turn away from discretizing the data into buckets (again the purpose is to put big data into discrete buckets so that it can be analyzed more quickly and easily).193965969800 Let’s provide the correct results and charts now, introduce another methodology to provide the requested output. You can see this chart is very useful, and the goal is to push the peak of the line further to the right. Incidentally a line chart was used rather than a bar chart as the results seemed more dramatic. In this chart we can see that there is a large (cliff) dropoff in purchase frequency after 7 purchases. We can also see 10% of customers make 8 purchases, but only about 4% of customers make 10 purchases. -215188088600This next chart is easier to read, the orange columns report the same story, however the chart is a combo chart with a second axis. The rising line is cumulative. We can now identify that 55% of the customers make 6 or fewer purchases. Similarly only roughly 11% of customers make 10 or more purchases. A good marketing campaign would be to encourage numerous small orders (triggered with promotions) to push up customer loyalty. It is well known that more loyal customers are less price conscious.It is worth noting that this code works on 10, 000 records or 10, 000, 000 records or more. SQL is very effective at consolidating huge amounts of data into small tables and charts. 23495832485As developing analysts, hopefully you can intuit that the SQL code is very prowerful and that it is worth your time and patience to learn and leverege it. This demo adds another layer of analytics, by providing a running total.00As developing analysts, hopefully you can intuit that the SQL code is very prowerful and that it is worth your time and patience to learn and leverege it. This demo adds another layer of analytics, by providing a running total.The first program had 4 variables. To improve organization all variables were placed in one section, the next section assigned values to each variable, with the next section used PRINT commands displaying them to screen. This input-processing-output organization is helpful and common. Here lines 2-4 just have one variable so it is expedient to declare the variable and assign its value in the same line.Line 6 again declares an array table to hold the results of the data provided by the INSERT INTO code from lines 9 – 14. It is important to note that after line 14, the full ~800 rows of results reside only in the memory of SQL Server machine you are working on. You have successfully loaded data into working memory of the database management system (DBMS). The INSERT INTO code on line 9 and 10 work together to receive the data provided via the SELECT FROM GROUP BY() statement on lines 12-14.The second condensing here doesn’t need discretization, so while previously we used a WHILE loop or a CASE statement (with or without NTILE) to do the second data compression, here we do it with a second GROUP BY() statement in lines 16 – 25. The array table that holds the pre-processed data is queried. This is industry standard ETL work (prudent to google ETL if you don’t recognize that acronym).Line 17 counts the # of customers for each # of orders (i.e., 53 customers made 3 purchases etc.) which is the GROUP BY term. We divide this count by the total number of customers (793) which was calculated at the top of the program and stored in the local variable @NumberCustomers. Notice again because we are dividing one integer by another, we must multiply the numerator by 1.0 to add the column of calculations which uncover for example, What % of the customers made only 1 purchase (only 1.5% that’s excellent!!)The code in lines 20 – 22 add the column that performs the running total. This column is needed to provide the data for the combochart, that can provides insights such as 55% of out customer make 6 or fewer purchases. Line 20 is the same as line 18 except that it is preceeded by the SUM() and ends with OVER so we are calculating the sum over the range which starts at the beginning of the dataset and ends with the current row.USE [AdventureWorksDW2012];Select [CustomerKey], [FirstName], [LastName], [Gender], CASE WHEN [Gender] = 'F' THEN 'Female'WHEN [Gender] = 'M' THEN 'Male' END AS [Gender], CASE WHEN [MaritalStatus] = 'S' THEN 'Single'WHEN [MaritalStatus] = 'M' THEN 'Married'END AS [Marital Status],CASE WHEN [YearlyIncome] BETWEEN 0 and 30000 THEN 'modest'WHEN [YearlyIncome] BETWEEN 30000.01 and 60000 THEN 'good'WHEN [YearlyIncome] BETWEEN 60000.01 and 90000 THEN 'great'WHEN [YearlyIncome] > 900000 THEN 'top'ELSE 'Unknown' END AS [Yearly Income], [BirthDate], CASE WHEN DATEDIFF(Year, [BirthDate] , GETDATE()) BETWEEN 0 and 40 THEN 'young'WHEN DATEDIFF(Year, [BirthDate] , GETDATE()) BETWEEN 41 and 50 THEN 'settling down'WHEN DATEDIFF(Year, [BirthDate] , GETDATE()) BETWEEN 51 and 65 THEN 'athletic mature'WHEN DATEDIFF(Year, [BirthDate] , GETDATE()) > 65 THEN 'mature'END AS [Age Group], [EnglishEducation], [EmailAddress], [AddressLine1], CONCAT([City], ', ', [StateProvinceName], ' ' , [EnglishCountryRegionName]) as [City State], [TotalChildren],[DateFirstPurchase] , CASE WHEN [HouseOwnerFlag] = 0 then 'Renter' WHEN [HouseOwnerFlag] = 1 then 'Homeowner' END AS [Home Owner], [NumberCarsOwned] FROM [dbo].[DimCustomer] as cINNER JOIN [dbo].[DimGeography]as gON c.[GeographyKey] = g.[GeographyKey]OK here is some CASE statement practice.DBA’s in the field they tell me they join tables all day and trim leading and trailing spaces off primary key fields and do other data cleansing (i.e., munging) to prepare datasets for delivery to analysts or managers to perform the analysis. Often the DBA or analyst has their workload, an in addition, they build datasets and dish them out to others for the analysis (or directly update the data to allow dashboards and reports to remain fresh and accurate).This example was inspired by DBA’s in the trenches, whome mentioned that they use CASE statements to turn data into meaningful descriptor – take a look at these examples. You can see that the CSE procesing is very helpful.The CASE statements are very straightforward, you can see they give you great control over the ranges as in the YearIncome field. If placing the customers into quartiles is good enough then use that functionality, if you need more refined specification of the ranges then use the CASE statements with the BETWEEN functioning as shown.He descriptors generated using the CASE statements will be shown in the slicer options (i.e., the boss will see them), so choose the descriptors carefully.Please run the query to the left, and examine the retrieved dataset. The result of the code is an improved dataset that is ready to be analyzed with pivot charts. Using CASE statements to append and improve retrieved data is a common pre-processing of data, that analysts and DBA’s perform. Now you know, and more importantly can expand your toolkit.17355575949442123644909320294350411075924275734844906Final comments:SQL is touted as the world’s most popular programming language partly because while the middle tier business rules (transaction processes and quality checks) system and the top-level human-interface level programming (i.e., tablet, phone, webpage) can be done in many different programming languages (PHP, , Python, , C#, etc.) the database interaction is done mainly in SQL (or MySQL). The dominance of SQL for structured data is not likely to change based in part, on its simplicity, power, and installed base. You can also use Hadoop and other data management approaches for unstructured data (ie, textfile, data streams, etc.) that is not the subject of any of this set of modules.In any programming language there may be times where looping over data is necessary. Looping is just running the same code over and over. Looping is also commonly used in ETL processes such as when developing SSIS packages; for example to retrieve the data from 100 different Excel files that have been uploaded from a series of chain restaurants to a corporate dropbox. So looping remains relevant and important programming concept. Many times however loops can be replaced with SQL group updates such as an UPDATE SET command.WHILE loops in SQL were demonstrated, and prove useful. The careful reader however will also have noticed that the same data retrieval problem (two-stage compilation) was also solved using a second GROUP BY processing of an array table. There are often different ways to perform the data compressing processes needed. Loops (and similar row traversing SQL cursors) are considered to be slow and like plodding your way through the data row by row. If you can perform the processing in a set based approach rather than looping a dataset, your code will be many times faster. Group based operations such as performed with SQL queries run on the entire dataset all at once, very fast. Loops in SQL are used to examine values in each row of a data table one row at a time; cursors are related T-SQL functionality that also runs one row at a time. More than one IT manager has lamented that DBA’s over rely on loops (and cursors) rather than figure out the set based programming logic (updating or transforming and entire set of records in big chunks, and using several steps. Maybe loops are an old programmer’s tool that many people are comfortable using. So sure SQL also has loops, however only WHILE loops and not FOR EACH or FOR NEXT loops. Having said all of this, loops are one of the tools in the toolkit that get the job done. The main concept here is that often you need two step data shaping processes. These processes need to be thought out and tested before used in production of reports. Keep learning!! ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download