Are you heading for the data science danger zone?

The story behind the theme song 'Danger Zone' landing on the Top Gun soundtrack is an interesting one. Movie and music producers tested dozens of song options before having a track specially written. After offering it to several artists, they landed on Kenny Loggins to deliver the iconic track.

Knowing what to include is something a data scientist probably thinks about daily. After all, rushing in without a clear goal or fishing for random data is going to land you a poor outcome. That’s when you cross over into the danger zone, exactly what we're looking at today.

In this article, we'll talk about what a data science danger zone is, and what you might be doing to end up there. We recently presented this at the Seeq Summit in Perth and discovered that more than a few people in the audience had done just that, and come across their own ‘Maverick’ and ‘Ice Man’ situation.

Apologies in advance if we get this iconic song stuck in your head! Let's do this.

Top Gun-Maverick

What is the data science danger zone?

We pretty much all spend time on the internet, and the majority of us are using AI tools to make life a little bit easier.

In your internet travels, perhaps you’ve come across the term ‘vibe coding’.

You can thank OpenAI co-founder Andrej Karpathy for coining the phrase when he said:

“There’s a new kind of coding I call ‘vibe coding’, where you fully give in to the vibes, embrace exponentials and forget that code even exists... I’m building a project or web app but it’s not really coding – I just see stuff, say stuff, run stuff, and copy-paste stuff, and it mostly works,”

Vibe coding uses tools to generate code from prompts, which can be tested and refined – all via LLMs. And this has developed further with AI coding assistants such as GitHub Copilot.

This style of coding can be incredibly helpful for hobby projects, but in regulated, high-stakes environments, vibe-based decisions can lead to critical oversights that have real-world implications.

The long and the short of 'vibe coding’ is that yes, these tools make coding more accessible, but that can also mean the bar to entry is now much lower when it comes to getting answers.

One place this can be dangerous is in data science...

Data_Science_VD-Drew Conway

The Data Science Venn Diagram, as laid out by Drew Conway.

“Don’t think, just do.” - Maverick

Lots of tools on the market have empowered a new wave of what we call ‘citizen data scientists’.

Usually, these are subject matter experts (who don’t have a data science background). They want to dive into the data and use data analysis tools to gain insights, but maybe they don’t have the formal training or background in engineering the data or visualising the data. This brings with it challenges of data quality.

It’s like Maverick pulling off an incredible aerial maneuver. He lands it perfectly, but when the debrief comes and there's a need for context, all he can say is, “I just felt it.”

You may find an outcome, but the path to get to it isn’t easily replicated, and it’s not easy to explain.

Process engineers, operators or planners already know the types of problems that might exist in their business, and now they have the tools to explore their data in real time and visually. They even understand what to look for, so they have everything they need, but the accessibility of the tool doesn’t mean data literacy has been replaced.

There are traps to watch out for, and that’s where the danger zone begins.

Keep reading, now we're going to cover some of the reasons why you might be entering the data science ‘danger zone’.

It could be because of:

Data dredging
Inadequate validation, or
Building great insights, with no path to operationalisation.

Data dredging

A quick glance at the table below might have you drawing conclusions about UFOs, what they use for fuel, and where they fill up.

What you’re looking at in the chart is an unexpected correlation between Google searches for the phrase 'report UFO sighting' and the volume of kerosene consumed in South Korea – mapped out over on Spurious Correlations thanks to Tyler Vigen.

Spurrious-Correlations

“Flies by the seat of his pants, totally unpredictable.” - Jester

This unexpected correlation is a wild one, but it is what happens when data dredging is at hand.

Maybe your company has invested in the infrastructure, and they have the data. While ‘fishing’ through it, a correlation has been identified. You’re in the danger zone when you try to find a hypothesis that fits the correlation.

You shouldn’t start by searching for answers in the data. Come up with a hypothesis first and then find data that either supports that hypothesis or rejects it.

Takeaway:

You can bring data together, but that doesn’t mean the correlations are correct.
Start by forming a hypothesis and select a small set of datapoints to test and validate it.

Inadequate validation

Now you’ve come up with the hypothesis, and you’re creating a model to validate and see how well it forecasts.

Data-Validation-small

Hey, this looks pretty good!

But if we zoom out, we haven’t understood the seasonality of our data.

Data-validation-large

The problem here is that looking at a small scale shows one thing, but the larger scale gives us more context and reveals a larger-scale pattern that we couldn’t see in the data previously.

"No, no, no. There's two Os in 'Goose,' boys." — Goose

The result of viewing the small scale – or moving ahead without validating the data – is it can lead to misguided decisions, wasted time and resources and other negative business outcomes.

Make sure to validate the correlations you’ve found with people outside your skillset. Talk to your process engineer. Talk to your data person. Talk to your operator.

Takeaway:

The best insights come from collaboration. Validate findings with people outside of your skillset.

Insights without operationalisation

Once your business has the model... then what?

Because to be successful, a model or insight needs to change a user's behaviour.

"It takes a lot more than just fancy flying."- Charlie

Every problem should start with an understanding of what behaviour you are going to change, because the answer to that question informs what you build.

Think about:

Who is using it?

What decisions are they making?

When do they make the decision?

There’s no point building a brilliant model if you have no clear plan to put it into action.

It’s important that you iterate with end users in mind, because change management is also a huge part of the implementation.

If you follow an insight and make a behaviour change, you're taking on some risk — what if the insight's wrong? You have to know if the business is willing to accept it.

How to avoid the danger zone

“You’re everyone’s problem. That’s because every time you go up in the air, you’re unsafe.” - Iceman

No one wants to be Maverick in this scenario.

Here are a few ways to know when you’re in the danger zone:

If you're acting on trends without understanding the cause… you're in it.

If you can’t explain the insight to someone else in the business… you’re probably in it.

If there isn’t a user or customer in the business that has a need for the insight... you’re probably in it.

The way to avoid many of these pitfalls is by starting any data modelling project with a clear objective in mind. Otherwise, it’s all too easy to waste time chasing meaningless patterns (like UFOs running on kerosene) and finding unreliable results.

“Talk to me, Goose!” - Maverick

It sounds easy! But what we know from speaking with clients, which was further validated at Seeq Summit, is that it can be quite a complex process. Sometimes you need to bring in the big guns to save yourself days of frustration or risky decisions. Nukon can bring together your cross-functional teams to help you generate the right outcomes, so that you can make better, data-driven decisions for your business.

Are you heading for the data science danger zone?