Calculating Percentages

When working with groups and aggregate functions, it is often needed to represent an aggregate value as a percentage, rather than as an absolute value.
We might want, for example, to calculate the number of hits per page type from a Web server log as a percentage relative to the total number of hits, rather than as the absolute number itself.

Consider the previous example query, that calculates the count of hits per requested page type:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) 
FROM ex040528.log 
GROUP BY PageType
PageType COUNT(ALL *)
-------- ------------
htm	115
css	22
gif	585
exe	25
nsf	142
swf	11
jpg	77
html	 1
dll	1
asp	5
js	 11
class	5
If we wanted to calculate the percentage of hits for each group, we would need to divide the number of hits within each group by the total number of hits in the whole log file; however, the use of the GROUP BY clause restricts each aggregate function to operate within the single groups, thus making it impossible to calculate at the same time the total number of hits across all groups.

To workaround this problem, we use two special aggregate functions available in the Log Parser SQL language: PROPCOUNT and PROPSUM.
When used in their basic forms, these functions calculate the ratio of the COUNT or ADD aggregate functions within a group to the COUNT or ADD aggregate functions on all of the input records.

Using the PROPCOUNT function, we can change the query above as follows:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, PROPCOUNT(*) 
FROM ex040528.log 
GROUP BY PageType
And obtain:
PageType PROPCOUNT(ALL *)
-------- ----------------
htm	0.115000
css	0.022000
gif	0.585000
exe	0.025000
nsf	0.142000
swf	0.011000
jpg	0.077000
html	 0.001000
dll	0.001000
asp	0.005000
js	 0.011000
class	0.005000
To show real percentages, we can multiply the aggregate function values by 100:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, MUL( PROPCOUNT(*), 100.0 ) AS PageTypeHits
FROM ex040528.log 
GROUP BY PageType
PageType PageTypeHits
-------- ------------
htm	11.500000
css	2.200000
gif	58.500000
exe	2.500000
nsf	14.200000
swf	1.100000
jpg	7.700000
html	 0.100000
dll	0.100000
asp	0.500000
js	 1.100000
class	0.500000
From the results of this query we can infer that, for example, requests to "css" pages represent the 2.2% of the total number of requests in this log file.

Calculating Percentages Across Multiple Group Hierarchies

The examples above show the basic form of the PROPCOUNT and PROPSUM functions, which calculates the percentage of an aggregate function within a group relative to all of the input records.
However, it is also possible to use the PROPCOUNT and PROPSUM functions to calculate percentages relative to hierarchically higher groups.
To do so, we can use the ON keyword after the PROPCOUNT or PROPSUM function name followed by a list of the GROUP BY field-expressions identifying which hierarchically higher group we want the percentage to be relative to.

Consider one of the previous examples, in which we calculated the total number of hits per page type per HTTP status code, modified to show percentages rather than absolute numbers:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL(PROPCOUNT(*), 100.0) AS Hits
FROM ex040528.log 
GROUP BY PageType, sc-status
ORDER BY PageType, sc-status
PageType sc-status Hits
-------- --------- ------------
asp	200	 0.500000
class	200	 0.100000
class	304	 0.400000
css	200	 0.300000
css	304	 1.000000
css	404	 0.900000
dll	500	 0.100000
exe	200	 2.500000
gif	200	 12.300000
gif	304	 45.000000
gif	404	 1.200000
htm	200	 3.400000
htm	304	 7.900000
htm	404	 0.200000
html	 404	 0.100000
jpg	200	 1.700000
jpg	304	 6.000000
js	 200	 0.400000
js	 304	 0.700000
nsf	200	 12.900000
nsf	302	 0.100000
nsf	304	 0.900000
nsf	403	 0.300000
swf	200	 0.300000
swf	304	 0.800000
The "Hits" field shows the percentage of hits for a page type and HTTP status code relative to the total number of hits.

If we wanted to calculate the percentage of hits for a page type and HTTP status code relative to the number of hits for that page type (i.e. the distribution of HTTP status codes within each page type), we would have written the query as follows:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL(PROPCOUNT(*) ON (PageType), 100.0) AS Hits
FROM ex040528.log 
GROUP BY PageType, sc-status
ORDER BY PageType, sc-status
The output would be:
PageType sc-status Hits
-------- --------- ----------
asp	200	 100.000000
class	200	 20.000000
class	304	 80.000000
css	200	 13.636364
css	304	 45.454545
css	404	 40.909091
dll	500	 100.000000
exe	200	 100.000000
gif	200	 21.025641
gif	304	 76.923077
gif	404	 2.051282
htm	200	 29.565217
htm	304	 68.695652
htm	404	 1.739130
html	 404	 100.000000
jpg	200	 22.077922
jpg	304	 77.922078
js	 200	 36.363636
js	 304	 63.636364
nsf	200	 90.845070
nsf	302	 0.704225
nsf	304	 6.338028
nsf	403	 2.112676
swf	200	 27.272727
swf	304	 72.727273
We can now infer that, for example, about 45% of requests to "css" pages returned an HTTP status code of 304.

Here we have used the ON keyword followed by the "PageType" GROUP BY field-expression. This notation indicates that we want the PROPCOUNT function to calculate the ratio of the COUNT aggregate function within a single group to the COUNT aggregate function within the hierarchically higher group identified by the "PageType" field-expression.

As another example, we can modify the previous example query to create groups based on the time the request was made at (quantized at 20-second intervals), the page type, and the HTTP status code:

SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status
FROM ex040528.log 
GROUP BY Interval, PageType, sc-status
ORDER BY Interval, PageType, sc-status
For each group, we can calculate the percentage of hits relative to the number of hits within the time interval and page type, the percentage of hits relative to the number of hits within the time interval alone, and the percentage of hits relative to the total number of hits:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status,
	 MUL( PROPCOUNT(*) ON (Interval, PageType), 100.0) AS Hits1,
	 MUL( PROPCOUNT(*) ON (Interval), 100.0) AS Hits2,
	 MUL( PROPCOUNT(*), 100.0) AS Hits3 
FROM ex040528.log 
GROUP BY Interval, PageType, sc-status
ORDER BY Interval, PageType, sc-status
Interval PageType sc-status Hits1	Hits2	 Hits3
-------- -------- --------- ---------- --------- ---------
00:28:40 css	200	 20.000000  1.470588  0.100000
00:28:40 css	304	 60.000000  4.411765  0.300000
00:28:40 css	404	 20.000000  1.470588  0.100000
00:28:40 exe	200	 100.000000 7.352941  0.500000
00:28:40 gif	200	 10.000000  1.470588  0.100000
00:28:40 gif	304	 70.000000  10.294118 0.700000
00:28:40 gif	404	 20.000000  2.941176  0.200000
00:28:40 htm	200	 11.764706  2.941176  0.200000
00:28:40 htm	304	 88.235294  22.058824 1.500000
00:28:40 jpg	200	 25.000000  1.470588  0.100000
00:28:40 jpg	304	 75.000000  4.411765  0.300000
00:28:40 nsf	200	 100.000000 35.294118 2.400000
00:28:40 swf	200	 33.333333  1.470588  0.100000
00:28:40 swf	304	 66.666667  2.941176  0.200000
00:29:00 ASP	200	 100.000000 0.216920  0.100000
00:29:00 GIF	200	 100.000000 0.433839  0.200000
00:29:00 asp	200	 100.000000 0.216920  0.100000
00:29:00 class	200	 50.000000  0.216920  0.100000
00:29:00 class	304	 50.000000  0.216920  0.100000
00:29:00 css	200	 14.285714  0.216920  0.100000
00:29:00 css	304	 28.571429  0.433839  0.200000
00:29:00 css	404	 57.142857  0.867679  0.400000
00:29:00 dll	500	 100.000000 0.216920  0.100000
00:29:00 exe	200	 100.000000 1.952278  0.900000
00:29:00 gif	200	 21.794872  14.750542 6.800000
00:29:00 gif	304	 76.923077  52.060738 24.000000
00:29:00 gif	404	 1.282051   0.867679  0.400000
00:29:00 htm	200	 34.090909  3.253796  1.500000
00:29:00 htm	304	 63.636364  6.073753  2.800000
00:29:00 htm	404	 2.272727   0.216920  0.100000
00:29:00 html	 404	 100.000000 0.216920  0.100000
00:29:00 jpg	200	 35.000000  1.518438  0.700000
00:29:00 jpg	304	 65.000000  2.819957  1.300000
00:29:00 js	 200	 50.000000  0.433839  0.200000
00:29:00 js	 304	 50.000000  0.433839  0.200000
00:29:00 nsf	200	 94.339623  10.845987 5.000000
00:29:00 nsf	403	 5.660377   0.650759  0.300000
00:29:00 swf	200	 50.000000  0.433839  0.200000
00:29:00 swf	304	 50.000000  0.433839  0.200000
00:29:20 NSF	200	 100.000000 2.127660  0.300000
00:29:20 asp	200	 100.000000 0.709220  0.100000
00:29:20 class	304	 100.000000 0.709220  0.100000
00:29:20 css	304	 60.000000  2.127660  0.300000
00:29:20 css	404	 40.000000  1.418440  0.200000
00:29:20 exe	200	 100.000000 2.836879  0.400000
00:29:20 gif	304	 97.142857  48.226950 6.800000
00:29:20 gif	404	 2.857143   1.418440  0.200000
00:29:20 htm	200	 15.789474  2.127660  0.300000
00:29:20 htm	304	 78.947368  10.638298 1.500000
00:29:20 htm	404	 5.263158   0.709220  0.100000
00:29:20 jpg	200	 15.384615  1.418440  0.200000
00:29:20 jpg	304	 84.615385  7.801418  1.100000
00:29:20 js	 200	 50.000000  1.418440  0.200000
00:29:20 js	 304	 50.000000  1.418440  0.200000
00:29:20 nsf	200	 61.111111  7.801418  1.100000
00:29:20 nsf	302	 5.555556   0.709220  0.100000
00:29:20 nsf	304	 33.333333  4.255319  0.600000
00:29:20 swf	304	 100.000000 2.127660  0.300000
From the query results we can infer, for example, that during the "00:29:20" time interval, about 78% of the requests to "htm" pages returned the HTTP status code 304.
In the same time interval, requests to "htm" pages returning the HTTP status code 304 made up for about 10% of the requests, and these requests represent the 1.5% of the total number of requests in the log.

The example above shows that a PROPCOUNT or PROPSUM function with no ON keyword is logically equivalent to using the ON keyword followed by an empty list of GROUP BY field-expressions, meaning that the percentage to be calculated should be relative to the highest hierarchical group identified by no field-expression, i.e. the whole set of input records.

In addition, it is also worth mentioning that the list of field-expressions specified after the ON keyword must be a proper prefix of the GROUP BY field-expressions. If, for example, the ON keyword is followed by three field-expressions, then these three field-expressions must match the first three field-expressions in the GROUP BY clause, and they must also appear in the same order as they do in the GROUP BY clause.
In other words, each PROPCOUNT function in the following query is correct, since the lists of field-expressions after the ON keyword are all a proper prefix of the GROUP BY field-expressions:

SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status,
	 MUL( PROPCOUNT(*) ON (Interval, PageType), 100.0) AS Hits1,
	 MUL( PROPCOUNT(*) ON (Interval), 100.0) AS Hits2
FROM ex040528.log 
GROUP BY Interval, PageType, sc-status
However, none of the PROPCOUNT functions in the following query is correct, since the lists of field-expressions after the ON keyword are not a proper prefix of the GROUP BY field-expressions:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status,
	 MUL( PROPCOUNT(*) ON (PageType, sc-status), 100.0) AS Hits1,
	 MUL( PROPCOUNT(*) ON (PageType), 100.0) AS Hits2,
	 MUL( PROPCOUNT(*) ON (Interval, sc-status), 100.0) AS Hits2,
FROM ex040528.log 
GROUP BY Interval, PageType, sc-status

© 2004 Microsoft Corporation. All rights reserved.