Demystifying data science, Part 1: Data Science 101

Mayank Sharma – applied mathematician, researcher and engineer rolled into one – reluctantly accepts the simpler title of data scientist. He recently left IBM after 13 years to join the wealth management firm Raymond James as their Head of Data Science.

This is the first of a 3-part blog series called “Demystifying Data Science”, where Mayank takes a practical, business-centred approach. We start with an introductory “Data Science 101” before delving further into real-life applications and common pitfalls to avoid.

Hear Mayank’s own introduction in this quick vyn (SmartVideoNote):

What is Data Science?

There are a lot of terms floating around in the media and the business world: Machine learning (ML), Artificial Intelligence (AI), data science, analytics, statistics… These expressions have been used interchangeably and often overlap – they’re all about trying to derive insights from data. Whichever term is used often has more to do with what business leaders think of as the latest buzzword at that time.

In principle, data science is an ensemble of disciplines. It doesn’t distinguish between what discipline you use – the idea is, you have some data, and you want to scientifically interrogate that data. Perhaps you want to find its relevance, the amount of meaningful information that exists within the noise, or whether it can actually be used to bring about some end result. It could and should include all the disciplines that one deems to be relevant, such as

  • Statistics/ML: Finding correlations, trends, patterns, clusters and anomalies; testing hypotheses and models; providing explanations for insights drawn
  • Survey design: If you’re working with people data, designing good data collection and field testing strategies is key
  • Data Engineering: How do you actually source the relevant data? How should you ETL (Extract, Transform, Load) your data? Should you ETL or is it better to ELT? It’s no use having data sitting in silos if it can’t be managed, combined and used by the people who need it
  • Algorithms: Ranging all the way from simple queries to specific methods from (Un)supervised learning or reinforcement learning (RL) drawing on building blocks such as Natural Language Processing (NLP) or Neural Networks.

Some domains are more readily aligned with certain types of queries, but the focus is always on asking “Do we have the right data and do we have enough of it? If we don’t have enough data – how can we use appropriate models on the noise or the uncertainty to extract something meaningful?”

From the perspective of “analytics”

Another way of looking at data science is from an old classification – analytics – which falls into 3 typical categories:

  • Descriptive: Tell me what the world looks like
  • Predictive: Given what the world looks like, and given some description of how the world changes through time, tell me what it will look like in a week or a month
  • Prescriptive – This is the most interesting one: What should I do or what levers can I move to change our predictions and achieve my business goals?

Predictive analytics became much more popular in the last decade – where people said “give me headlights” – give me dashboards that tell me where my revenue will be next year. But we still have a way to go before our prescriptive analytics are up to par.

The data science boom

There has always been data science in any field that tries to empirically gain insights from the data they’re collecting.

What we’ve seen in the past several decades is a broader appreciation of data science – no longer restrained to the more numerically driven disciplines such as astronomy or medicine. Data can now play a powerful role in informing business leader decisions. Data can be transformed from its raw formats into something that’s actionable, into something concise. In the last decade or so, there has been an exponential increase in the amount of data, in our understanding of how to harness that data and extract valuable insights, as well as an increase in computational power. This makes it possible for ideas that were previously frozen in the minds of academics, engineers, scientists, applied mathematicians, into something that’s actually applicable.

How data science became mainstream

Applications of data science can be found all around us.

Applications of Data Science are now all around us, and they are growing exponentially.

One factor that’s made it possible for companies of any size to benefit from data science is the democratisation of computational learning models and libraries. Large companies like Google, Facebook and Amazon have demonstrated that there is immense value, from a commercial standpoint, in looking at behavioural data, purchase data, demographic data… They’re able to assemble all that data and actually design algorithms that can improve their revenue, or improve technology or product adoption. By the deployment of data platforms, they are able to collect whole new data sets compared to what was previously available.

A wonderful progression I’ve seen is in the work that companies like Google do, which is they actually put data out there in open-source communities, making publicly available some of the very best speech translation routines, image processing routines, etc.

Until recently, people didn’t have platforms where, just by virtue of having people use them, they’re able to collect data on the usage. An astronomer was limited by peering through telescopes all night and coming up with wonderful insights simply by dint of their genius. Astronomy went through a transformation several decades ago, when it became a domain where computers would scan the sky constantly, and they would help the astronomers zero in on interesting features worthy of human investigation.

Of course, businesses wouldn’t be able to harness the power of data science if they didn’t have people with skills to implement and deploy data science. It has helped tremendously that there is such a groundswell of interest in AI and ML amongst students and experienced professionals alike. The availability of high quality, affordable (even free in many cases) online courses has reduced the barriers to entry. As we will discuss in another blog, this can be both good and bad.

Now that computational models are more publicly available, there’s little we can’t do – at least in theory…

This is part 1 of 3 in a blog series called “Demystifying Data Science”. Part 2, “Data Science: ‘Give me headlights and take the wheel’ ” highlights real-life applications of data science – the ways businesses are using it today, while part 3 explores how to avoid common pitfalls. Follow vyn on LinkedIn or Twitter to be the first to read these.

Leave a reply

Your email address will not be published. Required fields are marked *