In the modern and data heavy world, information is vital.
Businesses often find that they can increase their offerings to customers by analysing customer behaviours, interactions and trends. However, this often comes as an afterthought when designing a solution, and it quickly becomes apparent that the data can hold much more value than initially expected.
The following questions then arise:
How much of this information flows through our system?
What is the velocity of the information flow?
What is the size of the information?
How do I access customer data?
And most importantly — are we storing this information in a useful way?
One might say in defence, “We were building a platform. We didn't have the time or resources to incorporate everything. That’s normal!” But as you grow you have to adjust, reinvest and ensure that you take advantage of your most precious asset: data. This will also help you lay the groundwork for another important technology — machine learning.
In this article I’ll provide a high level view of the differences between big data and machine learning, and help you decide which you should be focusing on in your organisation.
Is it information or data?
I’ve talked about information flow within the system, but at the risk of sounding pedantic I should really have called it data. We use data, whether structured or unstructured, to generate information – when we hoard this data, we are simply not getting the full benefit. We need to analyse it, and extrapolate from it. Only then do we have information to work with.
A typical application will generate a large amount of data, from system logs and client data, to session data and usage data. This all becomes useful when we store the data in a meaningful manner and use the right tools to ask the right questions, such as: when are we most busy? When do we experience downtime? Who is using our application? How can we spot trends and make recommendations based on them to increase revenue?
This leads to the first step of our journey on comparing big data to machine learning.
What is big data?
As mentioned above, we are creating and consuming ever increasing amounts of data in modern applications. We care about the amount of data captured because we have to store and process it, and big data is a generic term for this process.
If we know what kind of information we are hoping to extract from the data, we can choose to store it in a structured format that we can query. Databases have been around in computer science since the very beginning and there are many resources available to explain how to store information in a relational format, which can then be queried using a number of programming languages.
There also are cases where we can’t structurally store data because it isn’t in a suitable format, such as a string or number. Media such as photo images, video or audio tend to get stored as plain files (or chunks of data that only a computer can interpret), and we don’t have a general method for querying this data for information since it isn’t in a human readable format. This becomes our unstructured data.
What is machine learning?
Ever the headline grabber, machine learning is a tool that allows us to make sense of both structured and unstructured data. With machine learning, we are training a computer model on a very large dataset, with the hope that the model will gain an insight into what information is contained within.
In some instances, we can guide the training using supervised learning; that is, we can label data and explain to the model what the data represents. We might label structured data as having a higher scoring metric when sales are high; we might label unstructured data such as images to label the subject matter. This allows the model to infer what the unstructured data represents.
Dependencies between AI and ML
After reading the above, you might have already drawn a relationship between big data and machine learning.
Big data is concerned with the collection, storage and inference of structured data. But without big data, we have nothing to train our machine learning models with.
This is a valid point, and highlights that there is indeed a dependency between the two. It also exposes a hierarchy of sorts: without a clear implementation of big data you are not ready to jump straight into machine learning.
This makes sense. Without source information we can’t expect a machine learning model to learn and expose inferences to us.
"If you wish to make an apple pie from scratch, you must first invent the universe"
Using the right tool for the job
Now that we’ve covered the main differences between big data and machine learning, we need to take a small step back and ask ourselves,“what are we trying to actually achieve?”
This is key to understanding if we need big data or machine learning. A basic approach I like to take is to group my questions into what’s and why’s.
What sold the most units last month?
What was the average site traffic between two dates?
What was our average storage spend over summer? Over winter?
What products were purchased in multiple shopping trolleys?
I generally find that a ‘what’ is easy to solve in big data by a skilled user with familiar investigation patterns. If you have ‘what’ questions, it’s often a good use case for big data.
Why was there an increase in baseball caps being purchased last month?
What parameters did marketing change that drove more visitors to the site?
Why did our storage costs change when we didn’t change anything?
Why do people like to purchase groups of unrelated items? How can we capitalise on this?
A ‘why’ takes guesswork and prodding at the data; this is difficult when you don’t know what you are looking for. If you have a ‘why’, it might be more useful to deploy machine learning.
Again, this is an oversimplification but a useful way to explain the different approaches.
What do I want to achieve?
I always recommend that you start simple for maximum velocity. Get your data cleanly stored and start creating information out of it.
Do you want information about your system? You have to begin by poking the data.
Only when you start running into the ‘why’ questions that you are struggling to answer should you look towards pushing that information into training a machine learning model — at that point you will have already given a lot of meaning to the information and will be ready to take your next step of the journey.
So… big data or machine learning?
When it comes to making this decision, there is a ninety ninety rule:
"The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time." - Tom Cargill, Bell Labs
This relates to the big data issue because you begin to get results from that first 90% of code written. If you spend that time on relational models, queries and analytics you can begin to learn (and profit) early on.
The remaining 10% (the unknown) should then be fed to machine learning once you have a useful platform in place.
So, what’s the answer to the question of ‘big data or machine learning’?
Both, in time.
Like any piece of work, data and machine learning should be a transitional piece; a linear movement that allows you to embrace information and to extract value, learning and adapting as you go.
If you’d like to learn more about how to get the most value out of your data, visit our Data and AI page for more information.
Our recent insights
FAIR data - what is it and why should you care?
One of our senior data consultants, Dr Alasdair Gray, explains what FAIR data is, who’s using it, why it’s so useful and some common misconceptions around it.
Do you know how to destroy your data securely?
In this final part of our data ethics series, we look at what data destruction is and how you can comply with GDPR required actions.
Discussions in data ethics: How to develop data ethics in local government
In the final part of the Discussions on Data Ethics series, Professor Paul Clough, TPXImpact and Lucy Knight, ODI, discuss data literacy, effective community involvement and the value of bad news.