Over the last couple of weeks I’ve run some analysis on the one million most popular sites on the web.
External resources, risky business
- The hosting site is compromised (by a hacker, or a former or current employee) and the scripts are replaced by something malicious.
- The company hosting the script goes out of business and a less well intentioned entity takes control of the domain
- The code is changed on the fly and it breaks your site
The damage such a script can do to your users, your website and your reputation is extensive: user data, cookies and sessions can be captured or hi-jacked, giving the controllers of the script access to user identities and control over their accounts as well as the ability to attempt to install malware on your visitors machines (for instance: to steal access credentials for online banking).
Here are the most interesting results from the analysis:
Jquery, by far the most commonly included library was present on 55% of the sites surveyed and this probably makes the jquery webservers some of the hottest targets to hack. 50% of the domains contained advertising of some form. Google content was embedded on 58% of the pages and facebook on 26%. This gives those companies an excellent angle on expanding the profile of the visitors those websites, after all, even if these are not google properties there is absolutely nothing to stop google or facebook from adding an entry in their user profiles to record the visit.
Flash seems to be very rapidly on the way out, less than 1% of the homepages of the domains I looked at still contained flash content, but keep in mind that these are the larger websites and so presumably they have more budget to stay with the times. On less well maintained sites there is likely percentage wise still more flash. Also, these are the homepages, there is no guarantee that the rest of the site won’t contain a large amount of flash and that the server could have sent a different page if the browser had indicated that it supports flash (which ‘phantomjs’, the headless browser all this work was done with does not).
The practical upshot of this is that on roughly two-thirds of the web you are not just talking to the URL that you see in the browser but also to any one of a number of other parties who have - in principle - absolutely no business knowing about your visit and if their servers are compromised have the ability to distribute malware or other nasty stuff to vast numbers of people.
Avoid including resources from smaller companies, copy them to your own server if the license allows it, or find an alternative on a host with good reputation.
By far the safest approach for website owners that care about their users and their users privacy is to simply not include anything at all from other people’s servers. The downside of this is that your users will download a few extra copies of some of the more common libraries but that’s a very small price to pay for a significant reduction in risk. Google analytics junkies in particular will have to weigh whether they feel their users privacy is more important to them than their ability to analyze their users movements on the site. Especially the number of adult sites and anonymous tip sites (for things such as child abuse, crime reporting, bullying reporting) and other sites in a similar vein that contain google analytics tags is food for thought.
In the slightly longer term there will be something called Subsource integrity hashes’ which should take care of at least the risks of code being changed out from under the website embedding it (Thanks to HN users mangeletti and cbr).
Some things that popped up while doing the work that struck me as worth reporting even though they are not about large numbers of websites.
- Not-so-anonymous: WeTip.com
wetip.com allows you to anonymously report crimes. Anonymous for small values of anonymous, the site does not even use encryption and on the page with the form it contains an 'add this' widget and a font from google, so I guess that as long as you don't mind those parties and everybody that sits on the pipes between you and wetip.com knowing that you reported a crime you should be fine. And if you end up not being fine then you have another crime to report (maybe use the phone in that case?). This is no the only anonymous tip site that made these (and other) mistakes but it serves as a good example.
- We-like-requests: Dailymail.co.uk
The homepage of the DailyMail (that paragon of journalistic virtue) loads a whopping 720 requests and takes over a minute on my broadband connection to completely load.
- Some major brand websites are using evercookies.
I'm not quite ready to name them because it may very well be that the included bits and pieces are from an advertising company with questionable ethics but this is really very bad, more to follow.
Many thanks to the creators of phantomjs that made the programming behind this project much simpler than it would have otherwise been.
If you want to replicate the results or if you want to modify the goals and/or the code then you can find all of the code + instructions on how to run it at github.com/jacquesmattheij/remoteresources