Navigating Single Points of Failure: Lessons from Personal and Professional Life

In the previous article “Beyond the Black Box: The Risks and Rewards of Abstraction”, I talked about abstractions and how they affect us…

Rishabh Kaushik

Jul 1, 2023 • 10 min read

In the previous article “Beyond the Black Box: The Risks and Rewards of Abstraction”, I talked about abstractions and how they affect us. Here, I’d like to talk about bottlenecks. I first explore a personal life story then extrapolate the concept to the professional and tech world.

The incident is a few years old now. I had recently shifted to Chennai (India), a new city for my job. Being new in the city, I didn’t know a lot of places, I didn’t speak the local language, I wasn’t acquainted with the local food taste, I didn’t have a personal mode of commute yet and I was staying alone. Having said that, this being 2020 it was still in liveable condition since I had the smartphone and there is a service for everything. I could order food that I liked online using multiple mobile apps like Swiggy/Zomato, I could book a cab using Uber/Ola, I could order groceries using Swiggy/Big Basket and I didn’t need to talk to the local population much when needed, I could try to talk in English or Hindi or use Google Translate in extreme conditions.

And then one day, my phone broke and all hell came loose. Let’s look at the repercussions.

Starting with essential, the food. I used to rely on Swiggy for my daily meals but Zomato and Swiggy are mobile-only platforms. Though both did have a web-based portal, they need mobile numbers and text message to sign in and didn’t support other kinds of sign-in. Okay, maybe I can contact one of my friends and ask them to order food for me. But how do I contact them? Can’t call or text or send Whatsapp text. e-mail? Social Media? Well, most of those needed 2FA, and the text or Authentication pin would come to my mobile. After some search, I was able to find that my laptop is already logged in to Instagram, and could contact one of my friends over there. He gave me his Swiggy credentials for login so that I could order food. After search and carting the items, comes the payment. Well, UPI apps are out of the question, I didn’t have cash, and Credit Cards need OTP. Finally, I asked the same fiend to pay for it and temperorily the food was taken care of.

Now, time to repair or find a new mobile. If I purchase one online, it’ll get delivered after a day or two which would be undesired. So, let’s find a local repair shop, preferably one where they could understand English or Hindi. Finding it wasn’t difficult, but commuting will be. No mobile for booking cab, no shops too close. I finally decide find a shop, couple of kilometres away then walk to it. Gave the mobile for repair and explained the problem. Went to pass time for 2 hours and got my mobile back.

On contemplating about the incident, I realized how much of our daily life is funneled through a single point. UPI need mobile devices, Credit Cards needs OTP for payments, many of the logins like email, social network requires 2FA either in form of a text message or Authenticator apps and many of the apps are mobile only. If the central device fails, you’re isolated from so many day-to-day activities which we just take for granted.

Extending the concept to tech, let’s discuss some of the case studies I came across

Network Infrastructure

Couple of years back, there was an outage in many Facebook services. Even though website outages are not uncommon, what made this outage stand out was there was a bug in backend of the system. This bug made DNS servers unavailable to rest of the system and many internal tools couldn’t work. This also meant, engineers could not log in remotely to the data centres because the remote login system works on DNS to resolve the names to IP address connection. A team on engineers were deployed physically to the data centre. To add more obstacles, the door access lock itself was RFID based and depended on backend servers to grant access to engineers; backend servers which were now inaccessible. Finally, they had to break the door physically in order to access the servers and fix the issue. Took some time but it got resolved. You can read about the complete technical article here — https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

Cloudflare, a company which has multiple offerings. One of their popular offerings is CDN, a service on which half of the internet rely on. Due to a bug, the CDN service went down and multiple sites all over internet couldn’t render correctly

And there have been similar incidents across internet. Giants like GitHub, Gitlab, Google, Microsoft everyone has fallen victim to such incidents. There is an exhaustive link which people from the field might be interested towards over here — https://github.com/danluu/post-mortems

Open-Source Code

Yes, Open Source is awesome. But open source also builds on top of code contributed by multiple authors, in fact that’s why the model works. You don’t build the complete house yourself; you use the bricks and interiors designed and manufactured by others to build your house. Now what if someone took few of the bricks which make an essential layer? The house comes crumbling down.

A similar incident happened in 2016, when 11 lines of code took down the internet. A simple obscure code on npm called left-pad, which did nothing more than remove spaces from the start of a string. A function so simple and yet in the web of dependencies, some of the heavily used libraries used the same. At a time, the developer who wrote left-pad had also written a framework called kik. The problem, it shared the name with a messaging app kik based out of Canada which launched some legal troubles for him. In the quarrel, he decides to take down all the open-source projects written by him over the years, one of which was left-pad hosted on npm and thousands of websites came crashing down since the dependency couldn’t be found. The matter was resolved when npm decided to un unpublish the packages. If you’d like to read more, the article captures further details — https://qz.com/646467/how-one-programmer-broke-the-internet-by-deleting-a-tiny-piece-of-code

Many developers have faced issues relating to versioning of the API or libraries which are under active development. The addition of features and security upgrades are the upsides to constantly updating systems, but also comes with a downside of the APIs or libraries not being backward compatible and developers constantly needing to maintain and test the software for each platform, each update and each change in dependency or API.

Electricity

Can you imagine your life without electricity? Today EVERYTHING depends on it. It’s not just lights to see in dark or fan/heaters/ACs to keep you comfortable or communication and entertainment devices. We are also moving towards commute using electric vehicles, and trying to convert almost everything to electricity. Now, similar to internet network, electricity is complex. There are production units, there are consumers and there are multiple levels of distribution, storage and conversion.

Imagine a catastrophe which takes down the electricity for a locality or a city or a state. It could be caused by solar flares, it could be caused by production units not having enough raw materials, it could be caused by strikes, it could be caused by overloading of grid. What if any of those occur? Well, you don’t need to imagine, you just need to hear/read/watch stories of 2019 Winter blackout in Texas, USA. In peak winter, everyone ramped up their heating systems causing grid to overload and issues to cascade resulting in 3 days without electricity for majority of the state.

Supply Chain

When in manufacturing, you work with multiple vendors to source multiple parts which are needed to build your product. Your manufacturing also depends on various machines and operators. If a critical part or machine is unavailable, your manufacturing line stops costing money, reputation and affecting further development down the line where your product will be used by customers.

Preventing the issues

The first question is can we even prevent these? My take is no, but we can work to minimize them. De risk our lives and our products so that these issues become less devastating. There are trade-offs to be considered, but carefully thought, we can build more reliable solutions, be in personal or professional life.

Looking back, mobile being a single point of failure could be still foreseeable, but what about variables not in our control? What if the government decides to shut off the internet of the complete region, which does happen on multiple occasions to prevent riots from extending? How would you carry out multiple activities which rely so much on internet? What if there is a power failure for a state due to any riots/war/technical failures? When a system grows or is used by large or critical number of users, each dependency and path need to be analysed. We need to analyse and have backups and replacement for critical choking points.

Be it hardware of software or supply chain, multiple services and vendors are essential for high performing systems which can’t be gotten rid of even if we tried. But, for every system, a fallback low-performing system can be planned and executed. The first thing neccesary for same is a dependency graph to nth level. Your deliverables depend on a, b, c, a depends on x, y, y depends on z and z might depend back on x and a. Then, a what-if study, asking question of what happens if a dependency fails be it vendor unable to fulfil your order, machine being powered down, software taken off the internet or internet and electricity being down and then planning backups for them which suits your needs carefully. Let’s look at specific examples

Software

Having code backups for open-source software locally on your server, relying less on hosting sites like pypi/npm/GitHub to retrieve them for you. This ensures that if the hosting site or developer takes it down. you still have something to fall back on, at least temporarily.
Fixing a version of dependency and API you’re working with. Most of the services which hosts the packages like npm/pypi/GitHub will these days allow you to retrieve a version of the dependency. Instead of choosing the latest version each time, use a version and freeze it till the project is complete. Then periodically update it on one system first, test it out before rolling it to production.
Freezing the system, if the solution is designed for a specific particular task, there is no need to upgrade it automatically. The operating system, hardware, software versions everything can be frozen in time and the system can be operated independently for years and decades. This does have downsides which I’ll be discussing in another article.
Feature flags, you should be able to able to turn on or off a feature without hassle. Writing software keeping that in mind is important.
Carrying it forward, just like you used dependencies to build your solution, your customers will be using your solution to build their systems. Remember to grant the same flexibility to them as you’d expect from your vendors. Make your software backward compatible, have it properly versioned, allow user to roll back a version, provide proper documentation. Build it to be sustained.

Hardware

While designing, try to select parts which has multiple drop-in replacements available, preferably from separate vendors.
Build your design modular, so that if there is a problem with one of the parts, it’s easier to isolate and only that section can be redesigned.
Ensure lifetime support for the parts. Make sure the part you’re using a strong supply chain and will be supported by the vendor and manufacturer for life of your product.
Build your design with software tools (CAD programs/ Simulators) which can be exported to another tool easily. Make sure you at least have a read-only version of your projects even if the software license expires.

Supply Chain

Find a vendor, and their replacement, and their replacement.
Find the manufacturer, and their replacement, and their replacement.
Make sure to diversify these enough across Region/ Quality/ Cost. If all your vendors are from same geographic location, if there is any natural disaster or political event, everything will be affected beyond recovery.

People

People and workforce are as important to your product as your tech and designs. Make sure the procedures are well documented so that if any person has a problem, the work can still be executed with others.
Avoiding process choke points, be it purchase approvals, design releases, customer communication or if these are heavily relied on a single person, their absence results in deliverables being choked or slips in schedule. It’s better to rely on a process and a team rather than single individual no matter how good they are.
Keeping the employees happy, it’s good to have backups, it’s better to not need them. No matter how good you do the documentation, each profession has intricacies which are ingrained in the person doing the job for years. Replacing the individual will always result in drop of information and quality.

Personal life

Similar to professional, build a dependency graph periodically, maybe once in six months or year to analyse if you’re depending too much on something.
Reduce the dependency on tech. It’s good to enjoy tech, it’s bad to depend on it. Have life skills that could be used without tech. Some of them could be cooking without electricity, washing without machines, being comfortable with heat and cold without depending on HVAC systems, or navigating without mobile maps.
Have distributed and local backups. Small scale generator or power storage for times when grid fails, cash for when payment system fails, stored water for when pipelines burst, bicycles when there is a gas shortage.
Financial safety. Don’t rely explicitly on single source of income, be it job or business or investment. Diversify the portfolio and have backups for when systems collapse.
Emotional Safety. Friends and family are an immense part of anyone’s life. But relying heavily on them can be a mistake, may it be your friends, girlfriend, spouse or parents. Mishaps can happen anytime, there might be an accident and you might not see them tomorrow or there might be a fight and you might not end up on good terms, whatever it is cherish those relationships while they last but prepare and be comfortable with your own company for when things go wrong.

In conclusion, I’d like to modify a quote by Mel Brooks. He said, “Hope for the best, expect the worst” which I would reword just a bit to “Hope for the best, prepare for the worst”.

Network Infrastructure

Open-Source Code

Electricity

Supply Chain

Preventing the issues

Software

Hardware

Supply Chain

People

Personal life

Sign up for more like this.