When working with groups and aggregate functions, it is often
needed to represent an aggregate value as a percentage, rather than
as an absolute value.
We might want, for example, to calculate the number of hits per
page type from a Web server log as a percentage relative to the
total number of hits, rather than as the absolute number
itself.
Consider the previous example query, that calculates the count
of hits per requested page type:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) FROM ex040528.log GROUP BY PageType
PageType COUNT(ALL *) -------- ------------ htm 115 css 22 gif 585 exe 25 nsf 142 swf 11 jpg 77 html 1 dll 1 asp 5 js 11 class 5If we wanted to calculate the percentage of hits for each group, we would need to divide the number of hits within each group by the total number of hits in the whole log file; however, the use of the GROUP BY clause restricts each aggregate function to operate within the single groups, thus making it impossible to calculate at the same time the total number of hits across all groups.
To workaround this problem, we use two special aggregate
functions available in the Log Parser SQL language:
PROPCOUNT and PROPSUM.
When used in their basic forms, these functions calculate the ratio
of the COUNT or ADD aggregate functions within a group to the COUNT
or ADD aggregate functions on all of the input records.
Using the PROPCOUNT function, we can change the query above as
follows:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, PROPCOUNT(*) FROM ex040528.log GROUP BY PageTypeAnd obtain:
PageType PROPCOUNT(ALL *) -------- ---------------- htm 0.115000 css 0.022000 gif 0.585000 exe 0.025000 nsf 0.142000 swf 0.011000 jpg 0.077000 html 0.001000 dll 0.001000 asp 0.005000 js 0.011000 class 0.005000To show real percentages, we can multiply the aggregate function values by 100:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, MUL( PROPCOUNT(*), 100.0 ) AS PageTypeHits FROM ex040528.log GROUP BY PageType
PageType PageTypeHits -------- ------------ htm 11.500000 css 2.200000 gif 58.500000 exe 2.500000 nsf 14.200000 swf 1.100000 jpg 7.700000 html 0.100000 dll 0.100000 asp 0.500000 js 1.100000 class 0.500000From the results of this query we can infer that, for example, requests to "css" pages represent the 2.2% of the total number of requests in this log file.
The examples above show the basic form of the PROPCOUNT and
PROPSUM functions, which calculates the percentage of an aggregate
function within a group relative to all of the input
records.
However, it is also possible to use the PROPCOUNT and PROPSUM
functions to calculate percentages relative to hierarchically
higher groups.
To do so, we can use the ON keyword after the PROPCOUNT or
PROPSUM function name followed by a list of the GROUP BY
field-expressions identifying which hierarchically higher group we
want the percentage to be relative to.
Consider one of the previous examples, in which we calculated
the total number of hits per page type per HTTP status code,
modified to show percentages rather than absolute numbers:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL(PROPCOUNT(*), 100.0) AS Hits FROM ex040528.log GROUP BY PageType, sc-status ORDER BY PageType, sc-status
PageType sc-status Hits -------- --------- ------------ asp 200 0.500000 class 200 0.100000 class 304 0.400000 css 200 0.300000 css 304 1.000000 css 404 0.900000 dll 500 0.100000 exe 200 2.500000 gif 200 12.300000 gif 304 45.000000 gif 404 1.200000 htm 200 3.400000 htm 304 7.900000 htm 404 0.200000 html 404 0.100000 jpg 200 1.700000 jpg 304 6.000000 js 200 0.400000 js 304 0.700000 nsf 200 12.900000 nsf 302 0.100000 nsf 304 0.900000 nsf 403 0.300000 swf 200 0.300000 swf 304 0.800000The "Hits" field shows the percentage of hits for a page type and HTTP status code relative to the total number of hits.
If we wanted to calculate the percentage of hits for a page type
and HTTP status code relative to the number of hits for that page
type (i.e. the distribution of HTTP status codes within each page
type), we would have written the query as follows:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL(PROPCOUNT(*) ON (PageType), 100.0) AS Hits FROM ex040528.log GROUP BY PageType, sc-status ORDER BY PageType, sc-statusThe output would be:
PageType sc-status Hits -------- --------- ---------- asp 200 100.000000 class 200 20.000000 class 304 80.000000 css 200 13.636364 css 304 45.454545 css 404 40.909091 dll 500 100.000000 exe 200 100.000000 gif 200 21.025641 gif 304 76.923077 gif 404 2.051282 htm 200 29.565217 htm 304 68.695652 htm 404 1.739130 html 404 100.000000 jpg 200 22.077922 jpg 304 77.922078 js 200 36.363636 js 304 63.636364 nsf 200 90.845070 nsf 302 0.704225 nsf 304 6.338028 nsf 403 2.112676 swf 200 27.272727 swf 304 72.727273We can now infer that, for example, about 45% of requests to "css" pages returned an HTTP status code of 304.
Here we have used the ON keyword followed by the "PageType"
GROUP BY field-expression. This notation indicates that we want the
PROPCOUNT function to calculate the ratio of the COUNT aggregate
function within a single group to the COUNT aggregate function
within the hierarchically higher group identified by the "PageType"
field-expression.
As another example, we can modify the previous example query to
create groups based on the time the request was made at (quantized at 20-second intervals), the
page type, and the HTTP status code:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status FROM ex040528.log GROUP BY Interval, PageType, sc-status ORDER BY Interval, PageType, sc-statusFor each group, we can calculate the percentage of hits relative to the number of hits within the time interval and page type, the percentage of hits relative to the number of hits within the time interval alone, and the percentage of hits relative to the total number of hits:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL( PROPCOUNT(*) ON (Interval, PageType), 100.0) AS Hits1, MUL( PROPCOUNT(*) ON (Interval), 100.0) AS Hits2, MUL( PROPCOUNT(*), 100.0) AS Hits3 FROM ex040528.log GROUP BY Interval, PageType, sc-status ORDER BY Interval, PageType, sc-status
Interval PageType sc-status Hits1 Hits2 Hits3 -------- -------- --------- ---------- --------- --------- 00:28:40 css 200 20.000000 1.470588 0.100000 00:28:40 css 304 60.000000 4.411765 0.300000 00:28:40 css 404 20.000000 1.470588 0.100000 00:28:40 exe 200 100.000000 7.352941 0.500000 00:28:40 gif 200 10.000000 1.470588 0.100000 00:28:40 gif 304 70.000000 10.294118 0.700000 00:28:40 gif 404 20.000000 2.941176 0.200000 00:28:40 htm 200 11.764706 2.941176 0.200000 00:28:40 htm 304 88.235294 22.058824 1.500000 00:28:40 jpg 200 25.000000 1.470588 0.100000 00:28:40 jpg 304 75.000000 4.411765 0.300000 00:28:40 nsf 200 100.000000 35.294118 2.400000 00:28:40 swf 200 33.333333 1.470588 0.100000 00:28:40 swf 304 66.666667 2.941176 0.200000 00:29:00 ASP 200 100.000000 0.216920 0.100000 00:29:00 GIF 200 100.000000 0.433839 0.200000 00:29:00 asp 200 100.000000 0.216920 0.100000 00:29:00 class 200 50.000000 0.216920 0.100000 00:29:00 class 304 50.000000 0.216920 0.100000 00:29:00 css 200 14.285714 0.216920 0.100000 00:29:00 css 304 28.571429 0.433839 0.200000 00:29:00 css 404 57.142857 0.867679 0.400000 00:29:00 dll 500 100.000000 0.216920 0.100000 00:29:00 exe 200 100.000000 1.952278 0.900000 00:29:00 gif 200 21.794872 14.750542 6.800000 00:29:00 gif 304 76.923077 52.060738 24.000000 00:29:00 gif 404 1.282051 0.867679 0.400000 00:29:00 htm 200 34.090909 3.253796 1.500000 00:29:00 htm 304 63.636364 6.073753 2.800000 00:29:00 htm 404 2.272727 0.216920 0.100000 00:29:00 html 404 100.000000 0.216920 0.100000 00:29:00 jpg 200 35.000000 1.518438 0.700000 00:29:00 jpg 304 65.000000 2.819957 1.300000 00:29:00 js 200 50.000000 0.433839 0.200000 00:29:00 js 304 50.000000 0.433839 0.200000 00:29:00 nsf 200 94.339623 10.845987 5.000000 00:29:00 nsf 403 5.660377 0.650759 0.300000 00:29:00 swf 200 50.000000 0.433839 0.200000 00:29:00 swf 304 50.000000 0.433839 0.200000 00:29:20 NSF 200 100.000000 2.127660 0.300000 00:29:20 asp 200 100.000000 0.709220 0.100000 00:29:20 class 304 100.000000 0.709220 0.100000 00:29:20 css 304 60.000000 2.127660 0.300000 00:29:20 css 404 40.000000 1.418440 0.200000 00:29:20 exe 200 100.000000 2.836879 0.400000 00:29:20 gif 304 97.142857 48.226950 6.800000 00:29:20 gif 404 2.857143 1.418440 0.200000 00:29:20 htm 200 15.789474 2.127660 0.300000 00:29:20 htm 304 78.947368 10.638298 1.500000 00:29:20 htm 404 5.263158 0.709220 0.100000 00:29:20 jpg 200 15.384615 1.418440 0.200000 00:29:20 jpg 304 84.615385 7.801418 1.100000 00:29:20 js 200 50.000000 1.418440 0.200000 00:29:20 js 304 50.000000 1.418440 0.200000 00:29:20 nsf 200 61.111111 7.801418 1.100000 00:29:20 nsf 302 5.555556 0.709220 0.100000 00:29:20 nsf 304 33.333333 4.255319 0.600000 00:29:20 swf 304 100.000000 2.127660 0.300000From the query results we can infer, for example, that during the "00:29:20" time interval, about 78% of the requests to "htm" pages returned the HTTP status code 304.
The example above shows that a PROPCOUNT or PROPSUM function with no ON keyword is logically equivalent to using the ON keyword followed by an empty list of GROUP BY field-expressions, meaning that the percentage to be calculated should be relative to the highest hierarchical group identified by no field-expression, i.e. the whole set of input records.
In addition, it is also worth mentioning that the list of
field-expressions specified after the ON keyword must be a
proper prefix of the GROUP BY field-expressions. If, for
example, the ON keyword is followed by three field-expressions,
then these three field-expressions must match the first three
field-expressions in the GROUP BY clause, and they must also appear
in the same order as they do in the GROUP BY clause.
In other words, each PROPCOUNT function in the following query is
correct, since the lists of field-expressions after the ON keyword
are all a proper prefix of the GROUP BY field-expressions:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL( PROPCOUNT(*) ON (Interval, PageType), 100.0) AS Hits1, MUL( PROPCOUNT(*) ON (Interval), 100.0) AS Hits2 FROM ex040528.log GROUP BY Interval, PageType, sc-statusHowever, none of the PROPCOUNT functions in the following query is correct, since the lists of field-expressions after the ON keyword are not a proper prefix of the GROUP BY field-expressions:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL( PROPCOUNT(*) ON (PageType, sc-status), 100.0) AS Hits1, MUL( PROPCOUNT(*) ON (PageType), 100.0) AS Hits2, MUL( PROPCOUNT(*) ON (Interval, sc-status), 100.0) AS Hits2, FROM ex040528.log GROUP BY Interval, PageType, sc-status
© 2004 Microsoft Corporation. All rights reserved.