This program was written shortly after I downloaded LiveWebStats (which can be found at http://www.chaosreigns.com/stats) and discovered it didn't quite do what I needed, but had a number of good ideas on how I could write something that did.
What did I need? Well, first of all, I wanted to be able to add hit counters to pages without relying on additional CGI, cute odometer images, etc. All the hits are being logged into Apache's access_log, so why add additional mechanisms to count hits already being counted? All that I really needed was something to scan my log files and tell me how many times a particular page had been hit. If it produced some nice reports along the way, that would be nice too.
LiveWebStats generates reports. Unfortunately, it writes them to HTML files. What I really wanted was something that wrote out snippets that I could then <--#include virtual="..." --> into my pages. I hacked on LiveWebStats for a while, changing the output to .shtml, trying to fix the tables (it generates "sloppy" tables -- <TD>'s with no </TD>'s, for example, which is fine for top level tables but fails miserably if the table is within another table, which unfortunately is how my entire website is set up), then decided to just start from scratch. I did this, but found myself frequently as I went along saying "Now how did LiveWebStats do that?" and pulling up that code and studying it while working on my own. Thus, although technically written from scratch, my code borrows quite a bit from LiveWebStats. My thanks to Darxus for the excellent ideas and code...
Alright, here's what it does. It starts up and scans your httpd logs (common or combined format), compiles a bunch of statistics, then writes out a bunch of files suitable for being included in your web pages. The files come in two varieties, tables and snips.
The tables are your basic reports on all the various statistics, see my website for examples. It should be noted that they aren't complete tables, they actually just contain rows. This is because how you format your tables is likely to be different from how I format mine, depend on the look and feel of your website. Thus, you're to create your own <TABLE> header, then just include the generated file as the body, something like this:
<TABLE ALIGN=CENTER BORDER=4 BGCOLOR="DDDDDD">
<TR><TH ALIGN=LEFT>Cool web surfing program</TH><TH ALIGN=RIGHT>Hits</TH></TD>
<!--include virtual="agent.table" -->
So YOU create the table, with whatever formatting, colors, etc. you want, and just use SSI to include the content.
Of course, these tables are a side benefit, what I really wanted was access counts. These are generated and stored in a fileinfo directory, where they can be included like this:
<P>Qtarot: download the source <A HREF="qtarot.tar.gz">here</A>!<BR> [Downloaded <!--#include virtual="/fileinfo/=2Fqtarot.tar.gz" -->.]</P>
Which looks ugly, but the viewer sees something like this:
Qtarot: download the source here!
[Downloaded 58 times since 2001-04-19 14:08:22.]
The fileinfo directory has a snip like that for every file you haven't excluded from the statistics counting. Lines from the log can be excluded based on IP address, file, or anything you can parse out with a regex. On my site I exclude all image files from the statistics, as I don't really care how many times people have downloaded my navbar (well, it's not really a bar, let's just call it a navigation gadget).
Somewhere in your Apache config, you'll have a line specifying where to record each access (usually in a file creatively called "access.log"). LEAVE IT ALONE! You don't want to replace it, but you want to add another one. Apache lets you log to as many different places as you like. Nifty, eh? So add something like this:
CustomLog "|exec /usr/local/bin/logjack.pl /usr/local/etc/logjack" combined
The first parameter to CustomLog tells Apache where to send its log information. By starting it with a pipe, we say we want the following program executed and the data piped to its STDIN. The "exec" is optional but prevents an extra copy of /bin/sh from hanging around in memory all the time. Next is the path to where you placed logjack.pl (could be anywhere you like), and finally, logjack.pl takes as its one and only parameter the location of is configuration directory (which again can be anywhere). The last CustomLog parameter is the log file format. Just say "combined".
Now, the configuration directory should contain three files: "config.pl", "files.ignore", and "log.ignore". The first contains variables you can set to control where the output is written, what format the snips should be in, what reports to generate, etc. "files.ignore" specifies which files you don't want statistics on, usually image files but whatever you don't want hit counts on or to see in your reports. "log.ignore" specified which log lines you don't want to see, usually you just want to put regexpressions that specify IP addresses for your own machines and perhaps robots who visit you, although potentially you could filter just about anything with this file.
One of the things config.pl specifies is the output directory. This directory should exist, and it should be somewhere accessible from the web so that its tables and snips can be read by your SSI. Inside that directory there must also exist a directory called "fileinfo" where individual file hit counts will be stores.
A note on the individual file hitcounts. They're stored in files whose names are based on but are not identical to the original files name. Basically, it's the file's URI, with all characters other than alphanumerics, periods, hyphens, and underlines converted to =XX (equals followed by two hex digits). If that doesn't explain it, just run the darn thing and look at the files in the "fileinfo" dir. You'll see. These are the files you want to include for page hit counts, download counts, etc.
Q: How "live" are the stats?
A: You specify that in config.pl. By default, it'll wait up to five minutes before writing new reports, but if your machine is relatively quick or your statistics aren't terribly big, you might want to bump that up. On my own site, which doesn't have a great many pages to keep track of, I'm never more than two minutes behind.
Q: Can you run it as a CRON job?
A: Theoretically. In fact, add "exit 0;" after the first "writestats" call in logjack.pl and it'll be perfectly suitable for that. But why would you want to? Each time it runs it would have to reanalyze all your logfiles, whereas if your run it through CustomLog it keeps that information up to date all the time at virtually no cost CPU-wise. Thus, there will never be a "CRON job" option in the program -- if you want that, hack it into the code yourself.
Q: Why doesn't it generate pretty bar graphs like LiveWebStats? A: Just 'cuz. :) I must admit I didn't pay too much attention to the generating of reports, they're there because it's easy to do once you've parsed the logs, but my main objective was to get automatic hit counts for all my pages. I may add a feature like that in the future, provided it doesn't chew up too many cycles (my webserver is a SPARCstation IPC, so I'm not big on "heavy" scripts). One thing you won't see is image generation on the fly...
Q: You really need to be using SSI or JSP or something like that to take
advantage of this program. Doesn't that stress the webserver? Wouldn't you
rather use static HTML pages?
A: There are, in fact, no straight .html pages on my website, everything is done with .shtml, so this program is designed to work with that. If you want straight .html pages generated, LiveWebStats already does an excellent job of this, so use it! As far as stressing the webserver, my webserver is a ten year old, 25 MHz computer (SPARCstation IPC) and it doesn't seem to have any problems with my ludicrous overuse of SSI. It sure takes the work out of making all my pages match the visual theme of the site.
Q: Does it work under non-Unix operating systems? A: I have no idea...