Thursday, June 10, 2010

Log Analysis Commentary

I've hade some requests to explain some of the less common functions used in my log analysis screencast. I think the most straightforward approach is to examine each of the lines in a literate Haskell style. This is going to be a long-winded description of exactly what's going on. If you understood everything in the screencast, this post will probably bore you. But if you found yourself wondering what the heck was going on, this post might help.
> :m + Data.List Data.Function
> contents <- readFile "user.log"
> let l = lines contents
> let t = map words l
> mapM print $ take 2 t
These four lines are pretty straightforward. ":m +" is GHCi syntax that is similar to an import. readFile :: FilePath -> IO String reads the contents of a file into a string. The lines function splits the string on newlines and creates a list of strings representing each line in the file. We map the words function over each of these lines to split the lines around whitespace. At this point t :: [[String]]. You can think of it as a table (hence the name 't') where each row is a line in the file and each column is a field. The "mapM print" displays the first two elements of t on separate lines.
> let noDay = map (\(d:ds) -> take 7 d : ds) t
> let months = groupBy ((==) `on` head) noDay
Now we get into the meat of the analysis. The noDay line uses a simple map and a lambda to strip off the last three characters of the first field in every line, turning the field into unique month identifier. "groupBy" is a handy function that groups a list into "partitions" where the elements in a partition are all equal for a user-specified definition of equality. In this case, we're grouping the rows in noDay and we want to use equality of the first field to define our groups. The 'on' function is a handy little tool defined in Data.Function that makes this easier.
on :: (b -> b -> c) -> (a -> b) -> a -> a -> c
On's first argument is a binary operator "b -> b -> c". It's second argument is a function that transforms a's into b's. It returns a new binary operator "a -> a -> c" that applies the transform function to the a's to get two b's that it can use with the original binary operator. In our example, the "a -> a -> c" is equivalent to "row -> row -> Bool" (straight out of the definition of groupBy). So the 'on' function helps us construct this row comparator by first transforming the row and then comparing those things. Our comparison function is (==), and our row transformation is "head", which gets us the month field.
Here's a simple example:
> let exampleList = [ (1,9), (1,7), (2,16), (2,6) ]
> groupBy ((==) `on` fst) exampleList
This groupBy call returns [ [(1,9),(1,7)], [(2,16),(2,6)] ]. It has grouped all the consecutive tuples with 1 as the first element into one list and all the 2's into a second list. These lists then must be grouped with a surrounding list. In the original example, we get a list of groups by months. The result can be conceptualized as list of bins representing each month where each of those bins is a list of all the log entries that happened in that month.
> let monthUniqs = map (nubBy ((==) `on` (!!2))) months
Our next line has the form "map ... months". This means that we're doing some operation on each of the "month bins" we just created. In this case our operation is "nubBy ((==) `on` (!!2))". It's very similar to the groupBy line. 'nubBy' removes duplicates from a list, where the supplied comparison function defines what things are duplicates.
> nubBy (==) [1,1,2,2,1] == [1,2]
We again call on the trusty 'on' function to make nubBy use the third field (the username) to determine equality. This removes all duplicate usernames from each of the month bins, so the number of items in each of the bins is the number of unique registered users that came to the site in that month.
> zip (map (head . head) monthUniqs) (map length monthUniqs)
Now we want to display the length of each of the bins. The lengths are more interesting when we know which months they go with, so we use the zip function to combine two lists into one list of tuples.
> let user = groupBy ((==) `on` (!!2)) $ sortBy (compare `on` (!!2)) t
By now these patterns should be looking familiar. Here we're grouping by the username field just like we grouped by the month field before. The only difference is that we have to sort the list by the username field first because groupBy only groups equivalent elements that are adjacent. The result of this is a list of bins representing each user.
> length users
The length of this list tells us the number of registered users that have logged in.
> let userDays = map (nubBy ((==) `on` head)) user
Now we're nubbing the user bins to remove duplicate days. (It's days because user was created from t instead of noDay.)
> let visitCounts = map length userDays
This tells us how many different days each user has visited the site.

None of what we have done here is particularly difficult. It wouldn't be hard to do the same thing with Ruby or Python. The point of the screencast is to show that it can also be done easily in Haskell, a statically typed, compiled language; and to demonstrate some useful functions in Haskell's standard library.

1 comment:

elzurk said...

Nice add to the video :)