The proper solution to this is "Server-Side Rendering" (SSR). I've tried SSR before, but it doesn't work. A lot of the code wasn't made for it in mind so it just fails. To fix it would mean going through a lot of code and also making sure any future code works in both cases. It's a big investment.
An alternative I've been thinking of is to go through every page in a web browser and dump the HTML. I've tried it this week using Puppeteer to automate the process and... it works.
There are a few issues though: cache invalidation (as always) and deciding where to run it. It can't run server-side since it's using a whole web browser which is slow and eats up a lot of resources, so it'll need to run in parallel. Which in turn means it'll need to know which pages to cache.
Right now, the experiment is using the sitemap.xml to scrape FrontierNav. That works for the more simple pages. But pages like the Maps have potentially hundreds of elements, and dumping all of that into HTML will be ridiculous. These pages are the most likely to be shared on social media. I could strip the excess post-render, but then the solution becomes more complex, at which point SSR becomes more appealing again.
Running in parallel is also a lot of waste since the vast majority of pages won't be directly navigated to. I could make it more intelligent and have it use access logs to find the pages it should cache, but then it's caching after the fact so it'll always be one step behind.
I've become more comfortable using Web Workers now. My first implementation of it for the Universal Search had a single global Worker searching across multiple indexes. Now, there's one for each index, created only when Search is activated and discarded after. I really can't tell what difference it's made since there aren't any Dev Tools that show individual performance metrics, but from an implementation point of view, it makes a lot more sense. There's no point having Workers waiting around, hogging resources.
Agility, the mobbing timer I made a while back, had a bug report. Since it's just a single web page, I recently stripped it down from a Middleman static generated site, to a simple static site with no build step. It's a lot easier to maintain now. I'm still a bit reluctant to be maintaining it though since it's not something I use anymore and I'm not getting anything out of it. But it's not a big deal.
I can't remember how I started going down this route, but I do know that as someone with multiple websites, I should be doing the most to ensure nothing malicious is being loaded onto my viewer's computer.
Actually, I do remember. I was looking into how FrontierNav can introduce an iframe-based, postMessage API to allow third-party integrations -- an exciting topic for another time. Loading iframes from other places is of course open to abuse, so I looked into securing it.
My sound card died this week. I didn't mind. A year ago I assumed the inteference coming from my speakers was from my motherboard's on-board sound so I bought a PCI one, avoiding an external one so that I don't have to deal with more cabling.
When I switched from Windows to Linux as my main operating system, the main issue I had was with drivers. Sound (Asus), Wi-Fi (Broadcom) and Video (Nvidia). They're all propertietory and Linux support is abysmal thanks to various free versus proprietary conflicts in interest.
I use CentOS on my servers, so I thought I'd try Fedora. But since Fedora has a Free Software policy, most of the drivers weren't officially supported. The sound didn't work at all. So I chose Linux Mint which came with everything out of the box.
After a few months, I noticed my speakers were still picking up random radio stations every now and then so I bought new ones. Problem solved. I didn't notice a difference in sound quality either with the sound card. So it was all a waste of time.
It seems since then Fedora 30 has greatly improved its support. After the sound card died and with the issues I've been having since last week, I gave Fedora another chance. This time, everything worked -- except the WiFi, so I switched over to my old PowerLAN, which now works because I relocated my router to a different room. Though, I'm not sure if the PowerLAN or Fedora is the cause of the random lag I've getting on my network. What a mess.
On the bright side, I actually like using GNOME now. I've always been an Xfce fanboy, but I've had do deal with a ton of caveats over the last year. Firefox flickers randomly and gives me a headache. The panels go out of sync with my multi-monitor setup. Music stops playing when I lock the screen. Screen tearing everywhere with Nouveau drivers. Broken resolutions with Nvidia drivers. The list is endless.
I wonder what new issues I'll run into over the next year with Fedora...
Weekly Report is going to be a new series of blog posts giving an update of what I did in the current week. The aim is to share what I've done and also to help me appreciate and compare my acheivements.
While there will be FrontierNav-related updates in this series. They'll also contain unrelated updates related to my other projects. If you just want FrontierNav updates, you can wait for the monthly FrontierNav Progress Reports.
Over the years I've become less and less trusting of third-party network requests on the websites I visit. In part, it's due to the ever escalating hoarding and selling of our personal data by ad-tech companies; something I've witnessed as I worked in the industry.
However, there are legitimate use cases for tracking on the public web to better understand your users and improve your product. In fact, I've come to the conclusion that it really is the only way to get accurate feedback. The vast majority of users will never tell you how they use your website, and the ones that do will likely skip over certain details.
So, to me, the problem with tracking isn't the tracking itself but how the data is managed. And the easiest way for me to make sure data isn't being misused is to host it myself.
It's worth mentioning why I need event tracking and what my requirements are:
I don't want to identify individual users. Aggregated data is good enough.
I want to know where users are coming from and where they're landing. This allows me to debug any issues such as broken links and also understand which online communities my users are from.
I want to know when certain call to actions are being triggered (such as buttons) to see how effective they are.
Absolute control. I need to be able to control exactly what gets stored and where so that I can take informed responsibility for the data.
Low maintenance. I don't want to maintain features I don't need. Such as databases and web portals that need to be always running.
Low cost. Since I'm not making any money out of it, I don't want to pay for features I don't need.
Why not use Request Logs?
One of the simplest ways to get some event data is to look at request logs. My requirements will be fulfilled by doing just that.
However, in my case, I've put my web server behind Cloudflare's CDN. Meaning, Cloudflare gets most of the requests, and only contacts my web server when it needs to refresh its caches.
Removing the CDN is not an option as it reduces a lot of my bandwidth costs and server load. And, as far as I know, Cloudflare's free tier does not provide network logs.
The only solution is to have a separate request sent directly to my server with similar details. This can be done either by using a separate domain or, to avoid cross-origin request issues, disable CDN caching using Cache-Control headers.
While the latter does mean the CDN is handling every request and likely logging it, that's already the case with most of the website's content. Removing the CDN also introduces other issues such as exposing the web server to direct malicious attacks.
There are plenty of self-hosted event tracking services that provide similar features to third-party solutions like Google Analytics. Matomo (formerly Piwik) is probably the most popular of the bunch.
At the end of the day, all these web analytics services can be broken down into three steps:
Send. A client sends events to a server.
Store. The server processes and stores events.
Query. The server provides an interface to query events.
Pretty much every solution differentiates themselves on their querying capabilities. So much so that Matomo, while mostly open source, places its more advanced features behind a paywall.
While these services satisfy my basic requirements, they also do a lot more, and as such, I lose a lot of control and have to maintain more than I'm actually using.
I already have an Nginx server compiled with a Lua module (via OpenResty), so ideally, a simple handler to log my events to disk will be enough. To simplify event processing, I can log my events as JSON, then query and aggregate those logs using jq. The server itself is not very powerful, so anything heavier, like Node.js, isn't possible.
1. Sending Events
Tracking has been a core part of the web for a while. So much so that web browsers have built-in mechanisms to send tracking events.
Anchor tags (<a>) have the ping attribute which sends a POST request to a list of URLs. However, this is only for Anchor tags so it won't work for buttons and other interactive elements.
The Beacon API also sends a POST request with a custom payload. It's specifically made for event tracking so browsers can optimise for it. In a perfect world, I'd use it, but the API doesn't work on certain browsers like Safari 10 so it isn't universal.
There's also the Fetch API which lets you send any request, and with the keepalive flag enabled, it's similar to the Beacon API but with more flexibility.
One of the most popular ways is to use the <img> tag programmatically to send a GET request. I'll be using this approach as it's common and lightweight. However, it's possible to use any of the methods above as at the end of the day, they all do the same thing: send a request. Chances are I'll switch to the Beacon API at some point.
Browsers request the src of an image as soon as it's set, regardless if it's on the page or not. Here, it'll send the request to my server's /_event endpoint which will log it to JSON.
The server then responds with an image consisting of a single blank pixel. This is where the term "Tracking Pixel" comes from, and a lot of HTTP servers come with built-in features to respond with this blank pixel. Nginx uses empty_gif.
The event payload is a query string, how that's generated is up to you. I personally used URLSearchParams with a polyfill for older browsers. I originally thought of using JSON.stringify to reduce the number of formats the payload needs to go through, however the URL becomes unreadable and difficult to inspect.
Pretty simple. Caching is disabled so that the CDN always forwards the request to the server where it's logged. Note that I'm using cjson which you'll need installed, ideally using LuaRocks.
2. Storing Events
In the location block shown earlier, we use Lua to log the query parameters as JSON. The way Lua is integrated into Nginx means that these log lines are logged into Nginx's error_log, rather than the usual access_log, which is reserved for... well, access logs.
One thing to mention is that the access logs also contain our events. So what's the point of the Lua block? The main reason is that it avoids parsing the query parameters externally and causing potential errors. By doing it all through Nginx, we are create a clear cut-off point from HTTP logging and event processing. We could even turn off access logs to reduce server load once everything's up and running.
Unfortunately, Nginx's error logs are wrapped with a lot of junk. For example, here's a truncated log line from our Lua block:
And there we go. No more personal information, just enough data to generate aggregates.
We can also use log rotation to automatically trigger extraction periodically and to delete older logs. I use logrotate myself but there are many others.
3. Querying Events
Since we now just have files of JSON, we can use any tool that consumes JSON to query our logs. jq provides more than enough functionality for my use cases. It's portable, fast, pipe-able and in general very convenient for most terminal-based work. But you can also push the data elsewhere like Logstash, Elasticsearch, really anything.
Well, that's pretty much everything. We have a client sending event data to a server which logs it as JSON. From there, we can do whatever we need to with the data. If I need to do anything more complicated such as understanding user journeys, I can easily add the necessary data on the client-side and query it server-side. If and when querying through the terminal becomes too laborious, I can easily import the data to something more suitable and run my queries there.
HTTPS is slowly becoming more and more common throughout the web, as it should. It provides a level of security and privacy that the web somehow neglected for decades. Decades worth of hyperlinks and caches have created a backlog of insecure HTTP destinations, regardless if the site itself now supports HTTPS.
Redirecting old HTTP traffic to HTTPS, while a good start to securing your website, still requires the initial insecure HTTP request to be made, which can be intercepted at any point in time.
This is where HTTP Strict Transport Security (HSTS) comes in. HSTS is a way to ensure your website is always loaded via HTTPS, without the need for constant insecure redirects.
I'll go through how to introduce HSTS to your website in a CDN configuration. Such a configuration has two points of insecure communication:
The CDN to the Origin (e.g. Cloudflare to Your Web Server).
The Client to the CDN (e.g. Web Browser to Cloudflare)
Before continuing, you need to make sure your existing servers supports HTTPS for all of its content. Insecure HTTP requests will no longer work. Also, make sure you've streamlined your SSL certificate update processes so that you're not locked out of your website when the certificate expires.
Also, please read the entire post before doing anything as some steps are difficult to reverse if you come across any issues.
Securing CDN to Origin
The Origin can be any number of things. An AWS S3 bucket, a Virtual Private Server (VPS), etc. HSTS is just another HTTP header like any other and regardless of which service you're using, the same principals apply.
To make sure your website doesn't break, it's worth doing this gradually. HSTS headers have 3 parameters:
This will make all requests cache your HSTS preference for 5 minutes (300 seconds). So if you publish this and remove it, any browsers that visited your website within those two points in time will carry on using HTTPS always for 5 minutes before asking again.
After you've added this header, you'll need to find some pages on your website that aren't already cached. Since we're securing the CDN to Origin network, we need to make sure the CDN can communicate with the Origin to refresh its caches.
Once your invalidated some caches, click around and make sure everything still works. Including any subdomains you own. You might want to wait a week and see if any new errors show up. How careful you are is up to you.
When you're happy with the results, you can bump up the max-age to whatever you want. I'd suggest a year, as we'll need it later on for the preload flag.
Now that the Origin is secure, we can apply the same steps to the CDN. Add the same header with a small max-age, make sure the website still works using the same processes and when you're happy, bump up the max-age to a year.
Now that we've got HSTS up, we can consider switching on preload. Preloading essentially lets browsers know your website is strictly HTTPS without the user needing to make that initial request to ask; which itself can be an attack vector.
To add your site to this list of known websites, you can submit it using the HSTS Preload List Submission website. Make sure to follow their requirements. They also have general advice around HSTS.
At the end of these steps, your website will be operating fully under HTTPS with no insecure channels.