20 Jul, 2019

FrontierNav Static Generation

FrontierNav is a JavaScript heavy application. So much so that loading it up without JavaScript will just show a warning. This is fine, since without JavaScript, FrontierNav becomes a bare bones wiki and there are plenty of better wikis around.

However, it does cause issues with search engines, social network sharing and others where they expect relevant information in the HTML on request. Google's search engine does use JavaScript, but others like Bing don't. Share a link to FrontierNav on Twitter and you'll just see a generic embed. This is a problem as it means FrontierNav isn't interacting properly with the rest of the web.

The proper solution to this is "Server-Side Rendering" (SSR). I've tried SSR before, but it doesn't work. A lot of the code wasn't made for it in mind so it just fails. To fix it would mean going through a lot of code and also making sure any future code works in both cases. It's a big investment.

An alternative I've been thinking of is to go through every page in a web browser and dump the HTML. I've tried it this week using Puppeteer to automate the process and... it works.

There are a few issues though: cache invalidation (as always) and deciding where to run it. It can't run server-side since it's using a whole web browser which is slow and eats up a lot of resources, so it'll need to run in parallel. Which in turn means it'll need to know which pages to cache.

Right now, the experiment is using the sitemap.xml to scrape FrontierNav. That works for the more simple pages. But pages like the Maps have potentially hundreds of elements, and dumping all of that into HTML will be ridiculous. These pages are the most likely to be shared on social media. I could strip the excess post-render, but then the solution becomes more complex, at which point SSR becomes more appealing again.

Running in parallel is also a lot of waste since the vast majority of pages won't be directly navigated to. I could make it more intelligent and have it use access logs to find the pages it should cache, but then it's caching after the fact so it'll always be one step behind.

Decisions. Decisions.