Aggregating Data Within Groups

All the query examples that we have seen so far share a common characteristic: the values of each output record were built upon the values of a single input record.
Sometimes, however, we might need to aggregate multiple input records together and perform some operation on groups of input records.
To accomplish this task, the Log Parser SQL-Like language has a special set of functions that can be used to perform basic calculations on multiple records. These aggregate functions (also referred to as "SQL functions") include SUM, COUNT, MAX, MIN, and AVG.

Aggregating Data

To show a classic example of the use of aggregate functions, assume that given an IIS W3C log file, we want to calculate the total number of bytes sent by the IIS server during the whole period recorded in the log file.
Considering that the number of bytes sent by the IIS server for each HTTP request is logged in the "sc-bytes" field, our command will look like the following example:

C:\>LogParser -i:IISW3C -o:NAT "SELECT SUM(sc-bytes) FROM ex040528.log"
Since the SELECT clause of this query makes use of the SUM aggregate function, the query will automatically aggregate all the input records, and calculate the sum of all the values of the "sc-bytes" field across all the input records; the output of this command will then look like the following output:
SUM(sc-bytes)
-------------
242834732
As the example shows, the result of the query is a single output record, containing a single value calculated across all the input records.

As another example, we might want to calculate how many requests have been logged in the log file.
Considering that each log file entry represents a single HTTP request, this task can be accomplished by simply counting how many input records are logged in the file:

C:\>LogParser -i:IISW3C -o:NAT "SELECT COUNT(*) FROM ex040528.log"
The example above makes use of the COUNT aggregate function. When used with the special "*" argument, the COUNT function returns the total number of input records processed by the query.

If we want to calculate how many requests satisfy a particular condition, for example how many requests were for an ASP page, we can add a WHERE clause to the query, and the COUNT function will only count input records satisfying the WHERE condition:

SELECT COUNT(*) FROM ex040528.log WHERE EXTRACT_EXTENSION(cs-uri-stem) LIKE 'asp'

Creating Groups

In the examples above, we have been using aggregate functions to calculate a value across all the input records; sometimes, however, we might want to calculate values across groups of input records.

As an example, we might want to calculate the total number of bytes sent by the IIS server for each URL. To perform this task, we need to divide all the input records into groups according to the URL requested, and then use the SUM aggregate function separately on each group.

This can be accomplished by using another building block of the Log Parser SQL language: the GROUP BY clause.
The GROUP BY clause is used to specify which fields we want the group subdivision to be based on; after the input records have been divided into these groups, all the aggregate functions in the SELECT clause will be calculated separately on each of these groups, and the query will return an output record for each group created.

Using the GROUP BY clause, our example query and its output will look like this:

SELECT cs-uri-stem, COUNT(*) FROM ex040528.log GROUP BY cs-uri-stem
cs-uri-stem			COUNT(*)
---------------------- --------
/Home/default.asp	5
/Home/images/bckgd.gif 419
/Docs/expl.htm		 12
/Docs/main.htm		 26
/login/frmx.dll		1

To make another example, assume that we want to calculate how many requests have been served for each page type (ASP, html, CSS, etc.).
First of all, we need to create separate groups according to the extension of the URL; after this group subdivision has been done, we can calculate a COUNT(*) on each group:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) 
FROM ex040528.log 
GROUP BY PageType
The output will look like:
PageType COUNT(ALL *)
-------- ------------
htm	115
css	22
gif	585
exe	25
nsf	142
swf	11
jpg	77
html	 1
dll	1
asp	5
js	 11
class	5

If we sort the output above according to the number of requests for each group, we will be creating a list showing the most requested page types first:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) AS PageTypeHits
FROM ex040528.log 
GROUP BY PageType
ORDER BY PageTypeHits DESC
The output will look like:
PageType PageTypeHits
-------- ------------
gif	585
nsf	142
htm	115
jpg	77
exe	25
css	22
js	 11
swf	11
asp	5
class	5
dll	1
html	 1

Groups can also be built on multiple fields, thus creating a hierarchy of groups.

For example, consider the following query:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, COUNT(*) 
FROM ex040528.log 
GROUP BY PageType, sc-status
This query creates groups according to the requested page type, and within each of these groups, sub-groups are created according to the HTTP status sent by the IIS server for the group page type; the aggregate function "COUNT" will then be calculated on each sub-group.
The output will look like:
PageType sc-status PageTypeHits
-------- --------- ------------
htm	304	 79
css	304	 10
gif	304	 450
exe	200	 25
nsf	200	 129
swf	200	 3
gif	404	 12
css	404	 9
htm	200	 34
css	200	 3
jpg	200	 17
gif	200	 123
jpg	304	 60
swf	304	 8
nsf	403	 3
html	 404	 1
dll	500	 1
asp	200	 5
js	 304	 7
class	304	 4
js	 200	 4
htm	404	 2
class	200	 1
nsf	304	 9
nsf	302	 1

It's important to note a particular language constraint derived from the use of the GROUP BY clause.
Whenever a query contains a GROUP BY clause, its SELECT clause can only contain any of the following:

In other words, the following example is a correct query:
SELECT 'hello', TO_UPPERCASE(cs-uri-stem), COUNT(*), SUM(sc-bytes)
FROM ex040528.log 
GROUP BY cs-uri-stem
In fact, the SELECT clause in the example above contains:
However, the following example is NOT a correct query:
SELECT date, COUNT(*), SUM(sc-bytes)
FROM ex040528.log 
GROUP BY cs-uri-stem
The SELECT clause in the example above contains a field-expression ("date") that does not appear in the GROUP BY clause.

The following example is also NOT a correct query:
SELECT TO_UPPERCASE(cs-uri-stem), COUNT(*), SUM(sc-bytes)
FROM ex040528.log 
GROUP BY SUBSTR(TO_UPPERCASE(cs-uri-stem), 0, 5)
The SELECT clause in the example above contains a field-expression ("TO_UPPERCASE(cs-uri-stem)") that is not derived from any field-expression in the GROUP BY clause; in this case, it's actually the field-expression in the GROUP BY clause that is derived from a field-expression in the SELECT clause.
The previous example can be corrected as follows:
SELECT SUBSTR(TO_UPPERCASE(cs-uri-stem), 0, 5), COUNT(*), SUM(sc-bytes)
FROM ex040528.log 
GROUP BY SUBSTR(TO_UPPERCASE(cs-uri-stem), 0, 5)

© 2004 Microsoft Corporation. All rights reserved.