Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

Mac ( @mac@programming.dev ) · edit-2 5 months ago

Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

Kissaki ( @Kissaki@programming.dev ) · 5 months ago

CrowdStrike ToS, section 8.6 Disclaimer

[…] THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION. NEITHER THE OFFERINGS NOR CROWDSTRIKE TOOLS ARE FOR USE IN THE OPERATION OF AIRCRAFT NAVIGATION, NUCLEAR FACILITIES, COMMUNICATION SYSTEMS, WEAPONS SYSTEMS, DIRECT OR INDIRECT LIFE-SUPPORT SYSTEMS, AIR TRAFFIC CONTROL, OR ANY APPLICATION OR INSTALLATION WHERE FAILURE COULD RESULT IN DEATH, SEVERE PHYSICAL INJURY, OR PROPERTY DAMAGE. […]

It’s about safety, but truly ironic how it mentions aircraft-related twice, and communication systems (very broad).

It certainly doesn’t impose confidence in the overall stability. But it’s also general ToS-speak, and may only be noteworthy now, after the fact.

goferking0 ( @goferking0@lemmy.sdf.org ) · 5 months ago

Weren’t the issues at airports because of the ticketing and scheduling systems going down, not anything with aircraft?

Kissaki ( @Kissaki@programming.dev ) · 5 months ago

Yes, I think so.

lad ( @sukhmel@programming.dev ) · 5 months ago

That’s just covering up, like a disclaimer that your software is intended to only be used on 29ᵗʰ of February. You don’t expect anyone to follow that rule, but you expect the court to rule that the user is at fault.

Luckily, it doesn’t always work that way, but we will see how it turns out this time

edit-2 5 months ago

Lawful Masses with Leonard French covered this yesterday. He is a copyright attorney. He starts the video with the opinion that the ToS wouldn’t protect CrowdStrike.

ByteOnBikes ( @ByteOnBikes@slrpnk.net ) · 5 months ago

It’s never a single person who caused a failure.

jonne ( @jonne@infosec.pub ) · 5 months ago

Yeah exactly. You’d think they’d have a test suite before pushing an update, or do a staggered rollout where they only push it to a sample amount of machines first. Just blaming one guy because you had an inadequate UAT process is ridiculous.

BB_C ( @BB_C@programming.dev ) · 5 months ago

Yesterday I was browsing /r/programming

:tabclose

polle ( @polle@feddit.org ) · 5 months ago

Microsoft also started blaming th eu. Its such a shitshow its ridiculous.

https://www.tomshardware.com/software/windows/microsofts-eu-agreement-means-it-will-be-hard-to-avoid-crowdstrike-like-calamities-in-the-future

luciole (he/him) ( @luciole@beehaw.org ) · 5 months ago

That is a lot of bile even for a rant. Agreed that it’s nonsensical to blame the dev though. This is software, human error should not be enough to cause such massive damage. Real question is: what’s wrong with the test suites? Did someone consciously decided the team would skimp on them?

As for blame, if we take the word of Crowdstrike’s CEO then there is no individual negligence nor malice involved. Therefore this it is the company’s responsibility as a whole, plain and simple.

thingsiplay ( @thingsiplay@beehaw.org ) · 5 months ago

Real question is: what’s wrong with the test suites?

This is what I’m asking myself too. If they tested it, and they should have, then this massive error would not happen: a) controlled test suites and machines in their labors, b) at least one test machine connected through internet and acting like a customer, tested by real human, c) update in waves throughout the day. They can’t tell me that they did all of these 3 steps. -meme

Hector_McG ( @Hector_McG@programming.dev ) · 5 months ago

Therefore this it is the company’s responsibility as a whole.

The governance of the company as a whole is the CEO’s responsibility. Thus a company-wide failure is 100% the CEO’s fault.

If the CEO does not resign over this, the governance of the company will not change significantly, and it will happen again.

Umbrias ( @Umbrias@beehaw.org ) · 5 months ago

I don’t know snough about the crowdstrike stuff in particular to have much of an opinion on it in particular, but I will say that software devs/engineers have long skirted py without any of the accountability present n other engineering fields. If software engineers want to be called engineers, and they should, then this may be an excellnt opportunity to introduce acccountability associations and ethics requirements which prevent or reduce company systemic issues and empower se to enforce good practices.

edit-2 5 months ago

Crowdstrike CEO should go to jail. The corporation should get the death sentence.

Edit: For the downvoters, they for real negligently designed a system that killed people when it fails. The CEO as an officer of the company holds liability. If corporations want rights like people when they are grossly negligent they should be punished. We can’t put them in jail so they should be forced to divest their assets and be “killed.” This doesn’t even sound radical to me, this sounds like a basic safe guard against corporate overreach.

Mubelotix ( @Mubelotix@jlai.lu ) · 5 months ago

I blame the users for using that software in the first place

Kissaki ( @Kissaki@programming.dev ) · edit-2 5 months ago

It’s a systematic multi-layered problem.

The simplest, least effort thing that could have prevented the scale of issues is not automatically installing updates, but waiting four days and triggering it afterwards if no issues.

Automatically forwarding updates is also forwarding risk. The higher the impact area, the more worth it safe-guards are.

Testing/Staging or partial successive rollouts could have also mitigated a large number of issues, but requires more investment.

wizardbeard ( @wizardbeard@lemmy.dbzer0.com ) · 5 months ago

The update that crashed things was an anti-malware definitions update, Crowdstrike offers no way to delay or stage them (they are downloaded automatically as soon as they are available), and there’s good reason for not wanting to delay definition updates as it leaves you vulnerable to known malware longer.

Kissaki ( @Kissaki@programming.dev ) · 5 months ago

How does a definitions update crash windows with a BSOD?

Gestrid ( @Gestrid@lemmy.ca ) · 5 months ago

Four days for an update to malware definitions is how computers get infected with malware. But you’re right that they should at least do some sort of simple test. “Does the machine boot, and are its files not getting overzealously deleted?”

Kissaki ( @Kissaki@programming.dev ) · 5 months ago

One of the fixes was deleting a sysm32 driver file. Is a Windows driver how they update definitions?

Gestrid ( @Gestrid@lemmy.ca ) · edit-2 5 months ago

The driver was one installed on the computer by the security company. The driver would look for and block threats incoming via the internet or intranet.

The definitions update included a driver update, and most of the computers the software was used on were configured to automatically restarted to install the update. Unfortunately, the faulty driver update caused computers to BSOD and enter a boot loop.

Because of the boot loop, the driver could only be removed manually by entering Safe Mode. (That’s the thing you saw about deleting that file.) Then the updated driver, the one they released when they discovered the bug, would ideally be able to be installed normally after exiting Safe Mode.

corsicanguppy ( @corsicanguppy@lemmy.ca ) · 5 months ago

We don’t blame the leopards who ate the guy’s face. We blame the guy who stuck his face near the leopards.

Kissaki ( @Kissaki@programming.dev ) · 5 months ago

But how do you identify a leopard when you don’t know about animals and it’s wearing a shiny mask?

MonkderVierte ( @MonkderVierte@lemmy.ml ) · edit-2 5 months ago

That one, then go up the chain of command.

Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

Let's blame the dev who pressed "Deploy" - Dmitry Kudryavtsev