Why Did Facebook Fail?

While the Facebook bungle has been widely described in the media as a “DNS problem”, it seemed to me from the outset that it was more likely to be due to BGP, the Border Gateway Protocol, one of the least known and potentially vulnerable part of the Internet's infrastructure. BGP, first defined in 1989 and in use on the Internet since 1994, dates from the era when the Internet was composed of a relatively small number of technically proficient and trustworthy institutions, and assumes those running it share those attributes. BGP is how the Internet routes packets among the multitude of independent networks that participate in its common network. The routers that run BGP accept routing information advertised by their peers by default. This means that a malicious router can pollute the routing tables of other routers, a form of attack known as BGP hijacking, of which a number of notable incidents have occurred, including that time in 2008 when Pakistan tried to block YouTube and ended up taking down YouTube world wide. But BGP's insecurity makes it just as prone to calamity from an unintentional fat-finger as deliberate malice, and that appears to be what happened to Facebook. It was perceived as a “DNS problem” only because Facebook's DNS servers had disappeared from the Internet, but so had everything else in Facebook's IP address ranges—that's the signature of a BGP face-plant.

BGP is neither simple nor straightforward, which is one reason it is little known, poorly understood, and easy to mess up. Here is an hour and a half deep dive into BGP, which may leave you even more confused that you are now.


If only it would fail permanently. It is the virtual Love Canal of the internet with a management layer of malevolence to boot.