So I'm writing a log parser. We finally realized that without some knowledge of who's coming from where on our site, we have no way of making decisions regarding marketing.
There are two ways I could do this. The first is to build the tracking & reporting in to the current platform. Problem is, this platform is supposed to be replaced soon, at which point the whole tracking system becomes obsolete. Also, it adds processing for every page request, slowing down the site. The other option was to parse the log files. I liked this option because its portable and it doesn't interfere with the processing of the page itself.
Turns out that the log files are big, 200-300 Mb each. It takes quite some time to process files that large. But thats ok, I have time. Over the weekend, the parser read two weeks of logs. As the files were parsed, the database grew. As the database grew, the whole process began to grind to a halt, as querying the database for previous sessions and information about the user was taking up 100% of the CPU.
The log files contain all the requests the server handled, including request for images, css files and such that I'm not interested in for analysis. As I read the file line by line, I first make sure the line is one that I'm interested in before processing and storing the data. This save a lot of processing, because most of the requests are for images.
At this point I hit a problem. The website creates a persistent cookie that the log parser uses to identify users. This works fine, except for first-time visitors. A first-time visitor will (obviously) not be sending his cookie to the server, so cant get logged. The server will set a cookie that will be sent along on subsequent requests, but responses aren't logged either. So the first hit will be orphaned from subsequent hits.
The second hit, for non-bots will be for an image or a css file. This request will be logged along with his cookie. I don't have a perfect way to match these two requests up, but I figure IP address and user-agent should be reliable enough. So I can no longer just ignore all those image requests, I run an update query to update the user record with his cookie ID where the IP address and the user-agent match.
This is a lot of update queries. And these queries (obviously) cannot be cached. And about 95% of those queries do nothing, as there is matching record to update. I haven't had the opportunity to test this, but I think that first testing for a matching record to update would significantly speed things up. Why? Because the select query can be cached, 95% of the time there will be almost no database lookup at all.
Item of the Day: Decolav Wall Mounted Wrought Iron Basin