From CrowdStrike to Boeing: Escalating challenges of third-party and supply chain risk

Nothing gets people prioritising third-party risk quite like a major IT vendor outage. We’ve witnessed some pretty large-scale incidents in recent years; the SolarWinds cyber attack in 2019 which impacted 18,000 users; several Amazon Web Services (AWS) outages that cost companies in excess of US$150m – and countless data breaches. But the CrowdStrike software update failure of July 2024 is now widely considered to be the largest IT outage in history. In our latest free report, we explore significant third-party risk events, such as those involving CrowdStrike, examine the intricate supply chain challenges faced by Boeing, and provide insights into managing risks within a progressively complex network of suppliers and partners.

You can view a PDF version of this article here.

What went wrong with CrowdStrike?

On 19th July, 2024, a software update issued by IT security firm CrowdStrike caused widespread major functionality problems for IT users around the world. CrowdStrike had released an update for its Falcon sensor software, a critical endpoint (IT device) security solution. Shortly after the update was deployed, users began reporting major problems with their computer systems.

Who was affected?

The outage affected thousands of organisations around the globe that rely on Microsoft operating systems protected by CrowdStrike software. This included hospitals, doctors’ surgeries, retailers, airports, airlines, rail operators, news and media outlets and financial institutions. Many of these organisations had no access at all to their computer systems, meaning no access to vital information such as medical records and essential processes such as paying wages. According to Microsoft’s estimates, the incident affected more than 8.5 million devices around the globe. According to Parametrix, which specialises in insuring IT cloud outages, only around 10-20% of the losses incurred due to the outage will actually be covered by insurance policies.

How did they fix it?

In the immediate aftermath, CrowdStrike recommended that affected users roll back to a previous stable version of the software, while a fix was being developed. Within a few days, the firm released a patch to address the performance and stability issues. In a statement, CEO of CrowdStrike, George Kurtz said: “This is not a security incident or cyber attack. The issue has been identified, isolated, and a fix has been deployed.” The issue only impacted Microsoft-based users of Crowdstrike – other operating systems such as Mac were not affected by the outage.

Compatibility issues

The update to CrowdStrike’s Falcon sensor exposed compatibility issues with certain versions of Microsoft Windows. This led to functionality problems with devices, including high CPU (central processing unit) usage and system instability. Many users were met with a blank blue screen when they tried to log in on 19th July. High CPU usage is linked to long loading times and can cause the computer to repeatedly crash. It is usually caused when a computer has to work too hard, perhaps due to running too many apps at one time, or running a very high-intensity app.

Damage control

CrowdStrike and Microsoft had to collaborate closely to diagnose the root cause of the problem and develop a patch to resolve the compatibility issues quickly. Both firms provided guidance to their mutual customers on how to mitigate the issues temporarily, such as rolling back to a previous version of the Falcon sensor, or applying specific Windows updates until the patch was available.

Were financial institutions affected?

A number of financial institutions reported issues due to the outage, but the impact was fairly minimal. Charles Schwab posted on its website on the first day of the outage: “Due to a third-party, global, industry-wide issue, certain online functionality may be intermittently slow or unavailable. We’re actively monitoring the issue. Phone services may be disrupted and hold times may be longer than usual.” Other banks reportedly impacted by the outage included Wells Fargo, TD Bank, Barlcays and Metro Bank. Barclays said in the immediate aftermath that all of its services were “operating as normal at this time other than our digital investing platform Smart Investor, where customers are currently unable to manage their account in the app, Online Banking or over the phone.” Payments systems provided by Visa were also affected, with many supermarkets and other retailers unable to take card payments for an entire day. Other financial services firms invoked their crisis management committees to investigate, monitor and address the issue, resulting in many thousands of management hours being lost.

Was the CrowdStrike outage preventable?

A full root-cause analysis into the CrowdStrike incident is still underway, but the general consensus of IT experts discussing the incident on internet forums seems to be that it was an inevitable outcome of a digital monoculture, with many saying it could just as easily have happened with a Microsoft update.

An interesting point was raised in one Reddit thread discussing whether the right people are responsible for decision making when it comes to selecting IT/cloud vendors. The poster, who appeared to be a computer programmer, argued that the world has found itself in this predicament largely due to people-pleasing in the C-suite at major corporations. The poster argued that senior executives want a name they can recognise when discussing cyber security vendors – someone they can easily Google to find a positive review. They are reluctant to challenge the status quo, or to invest into protecting the firm from a threat that is unlikely to occur – however great the potential impact may be.

The interesting thing about the CrowdStrike incident is that it was so far reaching, but wasn’t created by a malicious attack. The consequences would have been far worse had it been caused by cyber criminals, and the incident has highlighted just how vulnerable cloud-based systems are. Selecting a trusted supplier with a good track record and millions of users may not be the go-to approach in future. Global credit rating agency DBRS said the outage may “raise regulatory questions about the oligopolistic nature of critical IT infrastructure globally and could impact the critical software industry landscape over the long term.”

Legal action

One noticeable consequence of the CrowdStrike incident has been threats from the CEO of Delta Airlines, Ed Bastian, to sue CrowdStrike (or failing that, to sue Microsoft) for losses which the carrier experienced because of the outage, currently estimated at US$500m. Delta is itself facing a class action lawsuit from customers whose flights were cancelled. CrowdStrike responded to Delta saying it was not responsible for Delta’s problems, resulting in back-and-forth claims, threats and rejections. Time will tell what the full consequences will be and who will be held responsible for losses.

Regulation: tackling IT concentration risk

DORA

IT concentration risk has been a concern for regulators for some time and is being addressed in upcoming IT resilience legislation in Europe. The Digital Operational Resilience Act (DORA) is EU financial regulation due to be implemented in January 2025. It will require financial services firms to evaluate their own internal IT concentration risks before entering into IT contracts. Specifically, DORA emphasises the importance of understanding how subcontracting can impact concentration risk: for example, a business might have two suppliers offering similar services, but if both depend on the same cloud provider, there could be an unnoticed single point of failure.

Additionally, DORA empowers the European Supervisory Authorities (ESAs) to designate certain IT service providers as “critical ICT third-party service providers” and establishes an oversight framework for them. Various factors will be considered, such as the potential impact of a provider’s system outage on the financial system, particularly given the number of financial entities that rely on them. While the legislation primarily targets major cloud vendors, the CrowdStrike incident illustrates that concentration on a few software vendors for on-premises and endpoint solutions can pose risks similar to those associated with reliance on cloud “hyperscalers” such as AWS.

US regulation

The United States does not currently have a direct equivalent to DORA which specifically targets the operational resilience of the financial sector’s information and communication technology systems. However, it has several regulations and guidelines that cover similar aspects of cybersecurity, operational resilience, and risk management in the financial sector, including the Federal Reserve’s guidance on operational resilience, which emphasises the importance of maintaining critical operations during disruptions, including cyberattacks. The OCC and the FDIC have issued similar guidelines that stress the importance of ICT risk management, incident response, and recovery.

The US also has the 2015 Cybersecurity Information Sharing Act (CISA), which encourages information sharing on cybersecurity threats between the government and private sector, as well as the National Institute of Standards and Technology (NIST) Cybersecurity Framework, which provides a policy framework for private sector companies to assess and improve their ability to prevent, detect, and respond to cyberattacks.

US guidance on third-party risk

In June 2023, a joint-guidance document on handling third-party risk was published by the Federal Reserve, OCC and the FDIC. The guidance covers risks associated with a bank’s outsourced services, including independent consultants, referral arrangements, fintech partnerships, merchant payment services, any services provided by subsidiaries and affiliates and joint ventures. “The final guidance applies to all banking organisations, regardless of an organisation’s size and complexity and banking organisations should prepare for supervisory reviews focused on third-party risk management,” says law firm Covington and Burling. “The agencies enhanced the discussion in the final guidance’s closing section on their supervisory review practices for third-party relationships. For example, the final guidance states that examiners typically perform transaction testing or review results of testing to evaluate the third party’s activities and compliance with applicable laws and regulations. With this express reference, it seems fair to expect greater reliance on transaction testing as part of the third-party risk management examination processes.”

In other words, US regulators are upping their game in this area of oversight and after a global IT outage, it is only sensible to assume other jurisdictions will also shift focus.

The Boeing safety crisis: when complicated supply chains go wrong

If the CrowdStrike incident provides a good example of how digital monopolies or monocultures can oversimplify the supply chain and create wide-reaching, if relatively low-impact, risk events; then the problems Boeing has experienced with its 737 Max airliner are an example of the other end of the scale.

The story begins back in 2004 when, in order to cut costs and out-do its main competitor (Airbus), Boeing decided to outsource 70% of its design, engineering and manufacturing of entire modules to 50 third-party suppliers. Those suppliers would then collaborate using virtual design software known as PLM (product life management) software – a “Global Collaboration Environment” created in partnership with Dassault Systèmes. The idea was that the software would “make it easy for people around the globe to work together in real time.”

Fast forward to 2018 and serious problems with the 737 Max were becoming apparent. On 29th October, just minutes after taking off from Jakarta airport, Lion Air flight JT 610 – a 737 Max 8 aircraft – crashed into the Java Sea, killing all 189 people on board. Then just five months later, in March 2019 another 737 Max 8 aircraft, this time an Ethiopian Airlines flight, crashed shortly after take off, killing all 157 people on board.

In both crashes, investigations found that the pilots had tried unsuccessfully to override automated stabilisation software, which was forcing the aircraft into a nosedive position. No matter what they tried in order to manually override the system, nothing worked and both planes crashed.

In January 2024, another safety incident occurred when the middle exit door on the left side of a Boeing 737 Max 9 blew off at 16,000 feet during an Alaskan Airlines flight. Luckily, no one was killed, but if passengers had been sitting in the seats next to the door (they were vacant) they would almost certainly have been sucked out of the aircraft as the cabin rapidly decompressed. Missing bolts were identified as the likely cause of this incident.

People, processes and technology

The problems at Boeing appear to be both operational and cultural. Santiago Paredes, a former worker at Spirit Aerosystems, which is a supplier of fuselages to Boeing, recently went public with his experience of trying to raise safety concerns internally while he was working at the firm. He alleges he was nicknamed “showstopper” for slowing production by highlighting defects on parts that were being prepared for shipping to Boeing.

“I felt I was being threatened, and I felt I was being retaliated against for raising concerns,” he told the BBC. Paredes claims he was then ordered by management to change the way in which defects were reported, in order to reduce the number of defects on record. “They just wanted the product shipped out,” said Paredes. “They weren’t focused on the consequences of shipping bad fuselages. They were just focused on meeting the quotas, meeting the schedule, meeting the budget… If the numbers looked good, the state of the fuselages didn’t really matter.” Spirit Aerosystems has now been bought out by Boeing and the two will merge in 2025. It was initially part of Boeing anyway, until 2005 when it was spun off.

John Barnett, another Boeing whistleblower, who worked as a quality-control manager for almost three decades before retiring in 2017, raised concerns about safety issues he felt weren’t being dealt with properly by the company. This included metal shavings found near to the wiring for flight controls which could potentially have cut through wires during flight and caused a crash. He claimed that management ignored his concerns and moved him to another area of the plant instead of addressing them. Barnett died by suicide in March 2024, at which time he had been providing evidence for his whistleblower case.

“John was deeply concerned about the safety of the aircraft and flying public, and had identified some serious defects that he felt were not adequately addressed,” Barnett’s brother, Rodney, said in a family statement shortly after his death. “He said that Boeing had a culture of concealment and was putting profits over safety.”

Another whistleblower, Joshua Dean, who was a quality auditor at Spirit AeroSystems, filed a complaint alleging “serious and gross misconduct by senior quality management of the 737 production line” at Spirit. He died in May 2024 after developing breathing problems which led to pneumonia. A total of 32 people working at Boeing complained to regulators that they experienced retaliation from management when they flagged safety issues.

From an operational point of view, Boeing had made a common mistake in assuming that software alone would be the magic ingredient to simplify a complicated supply chain. “CIOs commonly say that implementing technology is not enough,” says Steve Banker, from SupplyChain Services in an article for Forbes. “Companies need to pay attention to people, processes and technology. There is growing certainty that Boeing is not paying enough attention to their own and their partners’ quality culture and processes.”

Profit over safety

Many point to the merger between Boeing and McDonnell Douglas in 1997 as the real catalyst for cultural decline at the firm. The 2022 Netflix documentary, Downfall: The Case Against Boeing explores the argument that Boeing began to prioritise profit over safety after the CEO of McDonnell Douglas, Harry Stonecypher, took the helm. His share value-focussed approach had a huge impact on the Boeing supply chain. The number of workers was cut drastically, so there was increasing pressure on engineers to build more planes in less time. Suppliers were also put under pressure to speed up delivery and cut corners.

Several internal emails and messages were made public during the investigation into Boeing and reveal just how obvious the problems were to employees – even before the fatal crashes took place. “This airplane is designed by clowns, who in turn are supervised by monkeys,” said one employee in April 2017. “They expect to do only two sets of one weeks airplane testing!!! Normally the FMC [flight management computer] is tested during an entire flight test program … Jesus, it’s doomed. I said we must do much more than that.”

Shifting blame

Boeing also launched a PR campaign following the crashes in an attempt to shift blame from itself onto the pilots and the airlines, claiming that the pilots did not respond to the software issues they experienced in the way they were expected to. Boeing even insinuated that American pilots would have reacted differently, suggesting they made mistakes because they were foreign. David Calhoun, CEO of Boeing, told the New York Times in 2020 that pilots from Indonesia and Ethiopia “don’t have anywhere near the experience that they have here in the US.”

The investigation into Boeing’s actions in the lead-up to the crashes in Ethiopia and Indonesia revealed that Lion Air (the carrier for the Jakarta crash) had contacted Boeing asking for more training for their pilots for the new 737 Max aircraft, but were denied – and even called “idiots” for asking. It later emerged that the pilots were in fact not even made aware of the existence of the new software systems installed on 737 Max aircrafts that caused the planes to nosedive (the software was known as MCAS – Manoeuvring Characteristics Augmentation System) because Boeing was trying to keep it hidden from the FAA (Federal Aviation Administration) to avoid regulatory measures and additional costs. Boeing knew that if their 737 Max aircraft appeared to differ too much from its predecessor, it would have to go through the FAA approval process, costing the company and its shareholders money and eating into profits and executive bonuses.

At the time, Boeing claimed it did not share details of MCAS because it didn’t want to “overwhelm” pilots with information. The airlines, pilots and passengers had put their trust in Boeing’s century-long reputation for quality and integrity, but were grossly misled – with catastrophic consequences.

Boeing’s supply chain: a closer look

Several key issues in Boeing’s supply chain contributed to safety risks in the years leading up to the fatal crashes:

Outsourcing, fragmentation and quality control: Boeing’s decision to outsource significant portions of its supply chain, including critical components and software development, led to fragmented control over quality and safety. The reliance on numerous suppliers, some of whom were not traditionally involved in aerospace, meant that Boeing had less oversight and control over the design and production process. Below are some examples of the suppliers used by Boeing:

HCL Technologies and Cyient (formerly Infotech Enterprises): These two Indian-based companies were subcontracted by Boeing for software development, including coding tasks for the MCAS system on the 737 Max. While both companies had experience in software development, their expertise was more aligned with general IT services and engineering rather than the highly specialised field of aerospace software engineering. This raised concerns about whether they had the deep domain knowledge required for such critical systems, and generated headlines such as “Boeing’s 737 Max Software Outsourced to $9-an-Hour Engineers.” According to one former Boeing employee, Mark Rabin, using these less-experienced coders involved a lot of “going back and forth” to make corrections to incorrect code. Rabin also claimed Boeing fired several senior engineers at this time because it felt the firm’s products were mature enough that they no longer needed their expertise.

Spirit AeroSystems: Although Spirit AeroSystems is a major aerospace supplier today, its roots are in the manufacturing of components for various industries, not exclusively aerospace. Spirit was responsible for producing significant portions of the 737 Max, including the fuselage and other structures. The rapid scale-up of production at Spirit, driven by Boeing’s demand for faster deliveries, led to serious quality control issues at Spirit’s manufacturing plant, highlighted by at least two whistleblowers.

Hexcel and Toray Industries: These companies supplied advanced composite materials for Boeing aircraft. While Hexcel and Toray are leaders in composite materials, their involvement highlights how Boeing was increasingly sourcing critical materials from non-traditional suppliers (in the sense of not being solely aerospace-focused) to leverage the latest materials technology, which could potentially raise new risks to quality control.

Latécoère: French company, Latécoère, traditionally involved in the manufacturing of aircraft doors and other aerostructures, was tasked with producing electrical wiring systems for the 737 Max. The wiring issues in the 737 Max were part of the broader concerns about how Boeing integrated the work of these various suppliers.

Communication issues

The complexity of Boeing’s supply chain led to communication gaps between Boeing and its suppliers. These gaps sometimes resulted in critical information about safety issues not being effectively communicated or addressed, for instance, the (deliberate) lack of communication with airlines, pilots and regulators about the MCAS system, which ultimately resulted in two fatal crashes.

Overall, Boeing’s supply chain practices, driven by cost-cutting and efficiency goals, created vulnerabilities that had serious safety implications. The lack of rigorous oversight, fragmented communication, and quality control issues within the supply chain played a significant role in compromising the safety of Boeing’s aircraft.

PLM system

The PLM system used by Boeing and provided by Dassault was more complex and less effective than anticipated, contributing to delays and quality issues. The PLM system was supposed to streamline data management across different teams and suppliers. However, problems with data accuracy and accessibility led to miscommunication and inefficiencies and design changes and updates were sometimes not communicated effectively to all stakeholders, causing delays and errors.

In a blog discussing the problems experienced at Boeing, James White, VP of strategy at PLM provider Duro (not Boeing’s supplier), said: “Today, OEMs [original equipment manufacturers], like Boeing, play the role of systems integrator, bringing together all the different elements of the plane from various supply chain companies. However, Boeing is still responsible for getting this right. Manufacturing companies use advanced software solutions like PLM to manage collaboration with the global supplier network and enforce governance standards such as quality, change management and full traceability. Issues arise when changes occur, and suppliers are switched in or out without consultation or awareness by the OEM. The Boeing and Spirit AeroSystems contract stipulates requirements and acceptance criteria, but how Spirit AeroSystems achieves that is primarily their business, meaning Boeing doesn’t always have complete visibility into supply decisions.”

A PLM system, argues White, can only do so much in reducing supply chain complexity. “Boeing likely has system-to-system connections to Spirit AeroSystems as well as strict contractual management. However, it’s less likely that this extends to Spirit AeroSystems’ subsidiaries and suppliers. It’s not plausible for Boeing, or any other OEM, to maintain full visibility and control throughout the entire supply chain.”

The consequences

In July, Boeing pleaded guilty to a criminal fraud conspiracy charge for the way it misled regulators and customers about the MCAS system and other safety issues in its supply chain. It was ordered to pay a US$243.6m criminal penalty and US$500m to a fund for the victims’ families. It is still unclear how the guilty plea will impact the firm’s government contracts in the US, which are not usually granted to companies with a criminal record. However, Boeing is considered by most as a company that is “too big to fail,” employing more than 170,000 people globally and generating revenues of nearly US$78bn in 2023 – so it is likely its criminal record will be overlooked.

The victims’ families felt the plea deal allowed Boeing to escape true punishment for its crimes. “What emerged from the negotiations was a plea agreement treating Boeing’s deadly crime as another run-of-the-mill corporate compliance problem,” they said in a statement. No prison sentences were handed out to Boeing executives.

Third-party risk and business continuity planning

Taking the example of the CrowdStrike outage again, the biggest and most immediate impact for affected businesses was business continuity. As many users couldn’t access their systems at all, it brought all activity to a halt. The fix also couldn’t be administered remotely – it had to be done manually by a technician, and to every individual device affected. This made putting the problem right extremely time-consuming and personnel heavy.

“Having a robust business continuity plan in place is probably the only real solution to protecting the firm from third party risks,” says Mike Finlay, CEO of RiskBusiness. “Events such as the CrowdStrike incident, or a cyber attack, or the demise of a major supplier or third-party service provider are rare, but inevitable. Having a plan B in place to limit disruption and allow continuity of service to your clients is the best protection because the supply chain is so complex and the problems that occur further up the chain can no more be prevented than they can be predicted. We know that relying on one supplier for an IT or cloud service has its risks. But we also need to delve into the vendors of vendors in order to see how their supply chains could be impacted. Business continuity risks are increasingly materialising from third-party suppliers and firms need to ensure that they have their own alternative plans in place. It’s also important to remember that not all vendor risk is associated with IT risk and that less tangible threats, such as the toxic, profit-driven corporate culture seen at Boeing, can be even more damaging – and even fatal.”

Due diligence: not just a box-ticking exercise

Banks and businesses must prioritise rigorous due diligence not only on their direct third-party vendors but also on fourth and nth parties (any party beyond the third member) within the supply chain, ensuring that all entities involved meet the necessary compliance and risk management standards. “Identifying all your fourth party risks is onerous enough,” says Richard Frykberg, CEO of IQ Business Solutions in a recent blog. “Attempting to assess all nth party risks, which by the power of geometric progression may be enormous, is practically impossible. Which comes back to primary vendor risk management. You need to be confident that your suppliers have themselves adopted rigorous vendor risk management processes. You need to make sure that their security posture is independently certified.” Requesting globally recognised accreditation such as SOC 2 or ISO 27001 is one way of ensuring your vendors are diligent. It’s important to continue to reassess nth party risk on a regular basis because although you may know who your own vendors are at all times, do you know when they are switching vendors, or merging with another company? How might this impact your operations?

“Firms must incorporate detailed terms and conditions regarding sub-contracting practices and mandate transparency around the disclosure of all supply chain participants within contractual agreements,” says Finlay. “Service level agreements (SLAs) must include specific terms that hold primary vendors accountable for implementing and maintaining stringent quality controls, as well as mechanisms for monitoring and verifying these controls across the entire service chain, with the hope of mitigating operational, regulatory and reputational risks.”

Leveraging AI

Supply chain complexities require firms to have a detailed and wide-reaching view of all interconnected industries in order to act on events within their network before they impact business, adds Finlay. “With the increasing use of artificial intelligence (AI) technologies, forward-looking supply chain management can now start to actively search for, detect and then monitor previously unknown relationships between a firm’s suppliers and the suppliers of those suppliers. By incorporating news service monitoring, a firm can start to proactively understand potential challenges arising in its supply chain when mentions are made in the news affecting a supplier to a supplier, then initiate remedial actions before the problem even starts to affect them. In this way, the procurement and vendor relationship management teams are now firmly part of the firm’s risk management infrastructure.”

Facebook
Twitter
LinkedIn