Monday, May 2, 2011

Aggregating Across Repeated Columns: Averaging

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

So far in this series, I've discussed the problem with querying textual or Boolean (Yes/No) data from repeated columns. But numeric data offers new challenges because we often want to do math on them. The most common kind of math is aggregation, that is, summing, counting, and averaging numeric values. This time, I'll talk about averaging data in repeated columns.

Others in this series:

Averaging Across Columns

The most common type of data aggregation, perhaps, is in calculating an Average where you divide the sum of the values by the count of the values.

In Excel, there is an AVERAGE function which will average the cells that have a value. The Excel function will work for any range of cells, across or down. In Access, however, the Average() function only works down columns, not across columns. So, averaging values in repeated rows is easy (see below), but just like summing and counting, averaging across repeated columns is more challenging.

For instance, suppose I had a table of student test scores:




If I wanted to average the test values for each student, I need to create an expression that sums the values for the numerator and counts the values for the denominator. For details on how to sum the values, see Aggregating Across Repeated Columns: Summing. To count the values, see Aggregating Across Repeated Columns: Counting

To do this, I need 4 stages.

1. Numerator: Sum the Values

Nz([Test1])+Nz([Test2])+Nz([Test3])+Nz([Test4]) AS TestSum

2. Denominator: Count the Values

Abs((Not IsNull([Test1])) + (Not IsNull([Test2])) + (Not IsNull([Test3])) + (Not IsNull([Test4])))

3. Handle Denominator of Zero

If the denominator is 0 (zero) the calculation will return the #Num! error. Therefore, I have to test the denominator for zero and convert it to a NULL. Dividing any value with a NULL will return NULL.

IIf((Abs((Not IsNull([Test1]))+(Not IsNull([Test2]))+(Not IsNull([Test3]))+(Not IsNull([Test4]))))=0,Null,(Abs((Not IsNull([Test1]))+(Not IsNull([Test2]))+(Not IsNull([Test3]))+(Not IsNull([Test4]))))

4. Average: Divide Numerator (Sum) by Denominator (Count)

(Nz([Test1])+Nz([Test2])+Nz([Test3])+Nz([Test4]))/IIf((Abs((Not IsNull([Test1]))+(Not IsNull([Test2]))+(Not IsNull([Test3]))+(Not IsNull([Test4]))))=0,Null,(Abs((Not IsNull([Test1]))+(Not IsNull([Test2]))+(Not IsNull([Test3]))+(Not IsNull([Test4])))))AS TestAverage

The full query would look like this:

SELECT Student, (Nz([Test1])+Nz([Test2])+Nz([Test3])+Nz([Test4]))/IIf((Abs((Not IsNull([Test1]))+(Not IsNull([Test2]))+(Not IsNull([Test3]))+(Not IsNull([Test4]))))=0,Null,(Abs((Not IsNull([Test1]))+(Not IsNull([Test2]))+(Not IsNull([Test3]))+(Not IsNull([Test4]))))) AS TestAverage
FROM StudentScores_RepeatedColumns
ORDER BY Student;


Or in the Query Builder:



The result would look like this:



Averaging Down Rows

By contrast, suppose I normalize the table to remove the repeated columns. The table should look something like this:



Since the table is normalized (that is, the values go down a row), I can use the aggregate (or "Totals") functions that are built in to SQL. In this case, it's the Avg() function.

Avg(Score) AS TestAverage

The full query would look like this:

SELECT StudentID, Avg(Score) AS TestAverage
FROM StudentScores_Rows
GROUP BY StudentID
ORDER BY StudentID;


Or in the Query Builder:



Once again, the results of the queries are identical:




Now, with only 4 test scores, the expression to average repeated columns is manageable. But what if there were 20 or 50? The expression quickly becomes long and cumbersome. But with the normalized structure, the query doesn't change no matter how many test values there are.

Alternate Solution: User Defined Function (UDF)

Although the best solution is to normalize your database, it is often not practical with an existing database. In that case, a User Defined Function (UDF) may be a solution.

Fellow MVP, John Spencer has created a UDF for averaging any number of columns:

Public Function fRowAverage(ParamArray Values())
'John Spencer
'Last Update: April 5, 2000
'Calculates the arithmetic average (mean) of a group of values passed to it.
'Sample call:
'myAvg = fRowAverage("1","TEST","2", "3",4,5,6,0) returns 3 (21/7)
'Ignores values that cannot be treated as numbers.
'
' Max of 29 arguments can be passed to a function in Access SQL
Dim i As Integer, intElementCount As Integer, dblSum As Double
  intElementCount = 0
  dblSum = 0
  For i = LBound(Values) To UBound(Values)
    If IsNumeric(Values(i)) Then 'Ignore Non-numeric values
      dblSum = dblSum + Values(i)
      intElementCount = intElementCount + 1
    End If
  Next I 

  If intElementCount > 0 Then 'At least one number in the group of values
    fRowAverage = dblSum / intElementCount
  Else 'No number in the group of values
    fRowAverage = Null
  End If

End Function

The function should go in a General Module. Then it can be used in a query as follows:

SELECT 
   Student, 
   fRowAverage([Test1],[test2],[Test3],[Test4]) AS TestAverage
FROM StudentScores_RepeatedColumns
ORDER BY Student;








No comments: