3 Jul, 2019

Self-Hosted Event Tracking with Nginx and Lua

Update: I've written up a simpler solution that you might want to take a look at. I don't recommend the approach below, though it's useful for understanding Lua in Nginx.

Over the years I've become less and less trusting of third-party network requests on the websites I visit. In part, it's due to the ever escalating hoarding and selling of our personal data by ad-tech companies; something I've witnessed as I worked in the industry.

However, there are legitimate use cases for tracking on the public web to better understand your users and improve your product. In fact, I've come to the conclusion that it really is the only way to get accurate feedback. The vast majority of users will never tell you how they use your website, and the ones that do will likely skip over certain details.

So, to me, the problem with tracking isn't the tracking itself but how the data is managed. And the easiest way for me to make sure data isn't being misused is to host it myself.

Requirements

It's worth mentioning why I need event tracking and what my requirements are:

I don't want to identify individual users. Aggregated data is good enough.
I want to know where users are coming from and where they're landing. This allows me to debug any issues such as broken links and also understand which online communities my users are from.
I want to know when certain call to actions are being triggered (such as buttons) to see how effective they are.
Absolute control. I need to be able to control exactly what gets stored and where so that I can take informed responsibility for the data.
Low maintenance. I don't want to maintain features I don't need. Such as databases and web portals that need to be always running.
Low cost. Since I'm not making any money out of it, I don't want to pay for features I don't need.

Why not use Request Logs?

One of the simplest ways to get some event data is to look at request logs. My requirements will be fulfilled by doing just that.

However, in my case, I've put my web server behind Cloudflare's CDN. Meaning, Cloudflare gets most of the requests, and only contacts my web server when it needs to refresh its caches.

Removing the CDN is not an option as it reduces a lot of my bandwidth costs and server load. And, as far as I know, Cloudflare's free tier does not provide network logs.

The only solution is to have a separate request sent directly to my server with similar details. This can be done either by using a separate domain or, to avoid cross-origin request issues, disable CDN caching using Cache-Control headers.

While the latter does mean the CDN is handling every request and likely logging it, that's already the case with most of the website's content. Removing the CDN also introduces other issues such as exposing the web server to direct malicious attacks.

Existing Solutions

There are plenty of self-hosted event tracking services that provide similar features to third-party solutions like Google Analytics. Matomo (formerly Piwik) is probably the most popular of the bunch.

At the end of the day, all these web analytics services can be broken down into three steps:

Send. A client sends events to a server.
Store. The server processes and stores events.
Query. The server provides an interface to query events.

Pretty much every solution differentiates themselves on their querying capabilities. So much so that Matomo, while mostly open source, places its more advanced features behind a paywall.

While these services satisfy my basic requirements, they also do a lot more, and as such, I lose a lot of control and have to maintain more than I'm actually using.

My Solution

I already have an Nginx server compiled with a Lua module (via OpenResty), so ideally, a simple handler to log my events to disk will be enough. To simplify event processing, I can log my events as JSON, then query and aggregate those logs using jq. The server itself is not very powerful, so anything heavier, like Node.js, isn't possible.

1. Sending Events

Tracking has been a core part of the web for a while. So much so that web browsers have built-in mechanisms to send tracking events.

Anchor tags (<a>) have the ping attribute which sends a POST request to a list of URLs. However, this is only for Anchor tags so it won't work for buttons and other interactive elements.

The Beacon API also sends a POST request with a custom payload. It's specifically made for event tracking so browsers can optimise for it. In a perfect world, I'd use it, but the API doesn't work on certain browsers like Safari 10 so it isn't universal.

There's also the Fetch API which lets you send any request, and with the keepalive flag enabled, it's similar to the Beacon API but with more flexibility.

One of the most popular ways is to use the <img> tag programmatically to send a GET request. I'll be using this approach as it's common and lightweight. However, it's possible to use any of the methods above as at the end of the day, they all do the same thing: send a request. Chances are I'll switch to the Beacon API at some point.

const track = (payload) => {
  const img = document.createElement("img");
  img.src = `${window.location.origin}/_event?${payload}`;
};

Browsers request the src of an image as soon as it's set, regardless if it's on the page or not. Here, it'll send the request to my server's /_event endpoint which will log it to JSON.

The server then responds with an image consisting of a single blank pixel. This is where the term "Tracking Pixel" comes from, and a lot of HTTP servers come with built-in features to respond with this blank pixel. Nginx uses empty_gif.

The event payload is a query string, how that's generated is up to you. I personally used URLSearchParams with a polyfill for older browsers. I originally thought of using JSON.stringify to reduce the number of formats the payload needs to go through, however the URL becomes unreadable and difficult to inspect.

On Nginx's end, I added this location block.

location = /_event {
  access_log /var/log/nginx/event_access.log main;
  error_log /var/log/nginx/event.log info;

  log_by_lua_block {
    ngx.log(ngx.INFO, require('cjson').encode(ngx.req.get_uri_args()))
  }

  # Disable Caching
  add_header Last-Modified $date_gmt;
  add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0';
  if_modified_since off;
  expires off;
  etag off;

  empty_gif;
}

Pretty simple. Caching is disabled so that the CDN always forwards the request to the server where it's logged. Note that I'm using cjson which you'll need installed, ideally using LuaRocks.

2. Storing Events

In the location block shown earlier, we use Lua to log the query parameters as JSON. The way Lua is integrated into Nginx means that these log lines are logged into Nginx's error_log, rather than the usual access_log, which is reserved for... well, access logs.

One thing to mention is that the access logs also contain our events. So what's the point of the Lua block? The main reason is that it avoids parsing the query parameters externally and causing potential errors. By doing it all through Nginx, we are create a clear cut-off point from HTTP logging and event processing. We could even turn off access logs to reduce server load once everything's up and running.

Unfortunately, Nginx's error logs are wrapped with a lot of junk. For example, here's a truncated log line from our Lua block:

2019/09/19 03:22:04 [info] 29900#29900: *1966485 [lua] log_by_lua(nginx.conf:76):2: {"level":"info","version":"v1.319.0-0-g05d524a-production","href":"https:\/\/jahed.dev\/about","logger":"client","source":"PageLogger","createdAt":"2019-09-19T02:22:03.590Z","referrer":"https:\/\/google.com"} while logging request, client: ...

We can extract and parse the JSON by piping some common unix commands.

cat /var/log/nginx/event.log | fgrep log_by_lua | sed --unbuffered -r 's/.*log_by_lua[^{]+(\{.+\}) while.*/\1/' | jq '.'

This will output something like:

{
  "referrer": "https://google.com",
  "level": "info",
  "version": "v1.319.0-0-g05d524a-production",
  "href": "https://jahed.dev/about",
  "logger": "client",
  "source": "PageLogger",
  "createdAt": "2019-09-19T02:22:03.590Z"
}

And there we go. No more personal information, just enough data to generate aggregates.

We can also use log rotation to automatically trigger extraction periodically and to delete older logs. I use logrotate myself but there are many others.

3. Querying Events

Since we now just have files of JSON, we can use any tool that consumes JSON to query our logs. jq provides more than enough functionality for my use cases. It's portable, fast, pipe-able and in general very convenient for most terminal-based work. But you can also push the data elsewhere like Logstash, Elasticsearch, really anything.

Conclusion

Well, that's pretty much everything. We have a client sending event data to a server which logs it as JSON. From there, we can do whatever we need to with the data. If I need to do anything more complicated such as understanding user journeys, I can easily add the necessary data on the client-side and query it server-side. If and when querying through the terminal becomes too laborious, I can easily import the data to something more suitable and run my queries there.

Thanks for reading.