Thursday, June 10, 2010

Log Analysis Commentary

I've hade some requests to explain some of the less common functions used in my log analysis screencast. I think the most straightforward approach is to examine each of the lines in a literate Haskell style. This is going to be a long-winded description of exactly what's going on. If you understood everything in the screencast, this post will probably bore you. But if you found yourself wondering what the heck was going on, this post might help.
> :m + Data.List Data.Function
> contents <- readFile "user.log"
> let l = lines contents
> let t = map words l
> mapM print $ take 2 t
These four lines are pretty straightforward. ":m +" is GHCi syntax that is similar to an import. readFile :: FilePath -> IO String reads the contents of a file into a string. The lines function splits the string on newlines and creates a list of strings representing each line in the file. We map the words function over each of these lines to split the lines around whitespace. At this point t :: [[String]]. You can think of it as a table (hence the name 't') where each row is a line in the file and each column is a field. The "mapM print" displays the first two elements of t on separate lines.
> let noDay = map (\(d:ds) -> take 7 d : ds) t
> let months = groupBy ((==) `on` head) noDay
Now we get into the meat of the analysis. The noDay line uses a simple map and a lambda to strip off the last three characters of the first field in every line, turning the field into unique month identifier. "groupBy" is a handy function that groups a list into "partitions" where the elements in a partition are all equal for a user-specified definition of equality. In this case, we're grouping the rows in noDay and we want to use equality of the first field to define our groups. The 'on' function is a handy little tool defined in Data.Function that makes this easier.
on :: (b -> b -> c) -> (a -> b) -> a -> a -> c
On's first argument is a binary operator "b -> b -> c". It's second argument is a function that transforms a's into b's. It returns a new binary operator "a -> a -> c" that applies the transform function to the a's to get two b's that it can use with the original binary operator. In our example, the "a -> a -> c" is equivalent to "row -> row -> Bool" (straight out of the definition of groupBy). So the 'on' function helps us construct this row comparator by first transforming the row and then comparing those things. Our comparison function is (==), and our row transformation is "head", which gets us the month field.
Here's a simple example:
> let exampleList = [ (1,9), (1,7), (2,16), (2,6) ]
> groupBy ((==) `on` fst) exampleList
This groupBy call returns [ [(1,9),(1,7)], [(2,16),(2,6)] ]. It has grouped all the consecutive tuples with 1 as the first element into one list and all the 2's into a second list. These lists then must be grouped with a surrounding list. In the original example, we get a list of groups by months. The result can be conceptualized as list of bins representing each month where each of those bins is a list of all the log entries that happened in that month.
> let monthUniqs = map (nubBy ((==) `on` (!!2))) months
Our next line has the form "map ... months". This means that we're doing some operation on each of the "month bins" we just created. In this case our operation is "nubBy ((==) `on` (!!2))". It's very similar to the groupBy line. 'nubBy' removes duplicates from a list, where the supplied comparison function defines what things are duplicates.
> nubBy (==) [1,1,2,2,1] == [1,2]
We again call on the trusty 'on' function to make nubBy use the third field (the username) to determine equality. This removes all duplicate usernames from each of the month bins, so the number of items in each of the bins is the number of unique registered users that came to the site in that month.
> zip (map (head . head) monthUniqs) (map length monthUniqs)
Now we want to display the length of each of the bins. The lengths are more interesting when we know which months they go with, so we use the zip function to combine two lists into one list of tuples.
> let user = groupBy ((==) `on` (!!2)) $ sortBy (compare `on` (!!2)) t
By now these patterns should be looking familiar. Here we're grouping by the username field just like we grouped by the month field before. The only difference is that we have to sort the list by the username field first because groupBy only groups equivalent elements that are adjacent. The result of this is a list of bins representing each user.
> length users
The length of this list tells us the number of registered users that have logged in.
> let userDays = map (nubBy ((==) `on` head)) user
Now we're nubbing the user bins to remove duplicate days. (It's days because user was created from t instead of noDay.)
> let visitCounts = map length userDays
This tells us how many different days each user has visited the site.

None of what we have done here is particularly difficult. It wouldn't be hard to do the same thing with Ruby or Python. The point of the screencast is to show that it can also be done easily in Haskell, a statically typed, compiled language; and to demonstrate some useful functions in Haskell's standard library.

Tuesday, June 8, 2010

Heist 0.2 Released

Yesterday I released version 0.2 of the Heist XML templating library. This release makes some significant API changes, so it may break existing applications. But if you're only using Heist for basic templating, then you probably won't have to change anything. However, http://snapframework.com uses Heist extensively; and it did not require any code changes to upgrade. Here is a summary of what changed in 0.2:
  • String substitution in attributes
  • Support for DOCTYPE in templates
  • New implementation for TemplateMonad
  • Windows support
  • Typeable instance
  • Documentation improvements
  • Bug fixes

The biggest new feature in 0.2 is attribute string substitution. Because Heist's templating mechanism uses XML tags, it didn't automatically work for string substitution inside tag attributes. You don't have to do very much web development before you find yourself wanting this. Now you can get substitution in attributes using the syntax "$(name)". The Heist Tutorial describes this in more detail.

Previously Heist did not allow DOCTYPE declarations inside its templates. This was caused by implementation details, but needed to be fixed. Now you can put DOCTYPEs in your templates. Heist removes the DOCTYPE and copies it into the rendered page. If it encounters more than one DOCTYPE while recursing through your templates, it will use the first one.

Thanks to help from Edward Kmett we have a complete rewrite of TemplateMonad internals in this release. Previously we were using RWST because Heist uses Reader and State functionality under the hood. The rewrite eliminates the unnecessary Writer overhead. In the process, we decided to make it more convenient to use TemplateMonad as a monad transformer. TemplateMonad now provides instances of the most common monad type classes, eliminating the need to use "lift" to access standard monad functionality in the inner monad. This will make it easier to use Heist with more complex monad stacks.

In addition to these improvements, our users also contributed Windows compatibility and a Typeable instance. The Typeable instance makes it possible to use Heist dynamically with the Hint runtime haskell interpreter. We love getting community contributions and hope to see more of them in the future.

For more information about Heist and the Snap Web Framework see http://snapframework.com.

Monday, June 7, 2010

Haskell Scripting: Log Analysis

The other day I wanted to analyze some of my website log files to get a better idea of how many active users I have. I've been meaning to do this for quite some time, but have kept putting it off. I decided to see what I could accomplish by just doing some experimenting in GHCI. It was so easy and convenient that I decided to do a screencast demonstrating what I did and how easy it was. The conciseness of Haskell combined with the instant feedback of an interpreter make a very powerful combination. Here's the screencast on vimeo.