Good Data Vs Big Data, Which is More Important?
An overview of good data & big data and a discussion on which of them should be prioritised...

I have experience and a background in electronics. I started out building embedded systems and my passion for robotics made me branch out into artificial intelligence.
Now I'm fully into machine learning and computer vision, I love getting my hands dirty with cool and interesting projects. In the future, I plan on my merging my knowledge of electronics and AI into the field of robotics.
I studied Computer Engineering at the Obafemi Awolowo University, Ile-Ife, Nigeria. When I'm not coding or doing chores, you would find me playing video games or enjoying some rock music.
In this article, we'll be taking a break from math equations and lines of code to discuss a topic that comes up fairly often in the data industry. To have an overview of what we'll be discussing, do take a look at the table of contents at the beginning of this article. 👆
What is Data?
Before we delve into anything, we should, first of all, know what we are even talking about. At the basics, data can be referred to as raw information as I learned when I was still a kid. According to a post on GeeksforGeeks, in the real sense of it, data is any unprocessed fact, value, text, sound or image that is to be interpreted or analysed.
In machine learning, data is very important and trying to make sense of it is the whole essence of what we do, which means we need it big and we need it good. This leads to the main terms of interest in this article.
What is Big and Good Data
sas.com defines big data as large and hard-to-manage data which can be structured or unstructured. It is so large or complex that it is difficult to process this kind of data with traditional methods. Big data allows you to get better insight from data as it incorporates a lot of fine-tuned information within it.
That being said we should consider what we mean by good data. Good data refers to data that is highly accurate, complete, conformant to storage standards, consistent, unique and valid. It is often referred to as clean data (I shall be using both "good" and "clean" interchangeably in this article).
And now the big question:
Which is More Important?
Now that we know what big data is and what good data is, which should be considered a priority? I wish I could just give a straight answer but it's not that simple. I will break it down into 3 parts.
Part 1: It depends on the application
It really does depend on what you would like to use your data to do. For example, if you just wanted to do some analysis of the data to develop some insight, you might consider big data a priority rather than clean data whereas someone else who uses these data to build machine learning models might consider clean data a priority as feeding very bad data in your model during training can give undesirable results.
Even in the case of building machine learning models, it could still depend on the kind of algorithm you're using to build. A statistical machine learning algorithm like linear regression, logistic regression, or decision trees would work really slowly with big data and would take a long time to train a model. In this case, an ML engineer or data scientist might decide to prioritize clean data. Deep learning algorithms, on the other hand, require huge volumes of data to work efficiently and an ML engineer or data scientist might prioritize big data in this case.
Part 2: Ideally, none is more important
If we should look at the problem from the ideal perspective, both of these data properties (good and big) should be balanced out. Big data that is relatively bad would give inaccurate results and in the same vein, good data that is relatively just too small will not give the expected result. So the best thing is to work towards getting data that is big in terms of volume, variety and velocity;
- Volume: Large amount of data.
- Variety: Different data points with respect to areas of concentration.
- Velocity: Data is created at a rapid rate.
And not just big but also big data that's good in terms of;
- Accuracy
- Completion
- Conformance to storage standards
- Consistency
- Uniqueness
- Validity
Part 3: Gun to my head, pick one?
Say a gun was pointed at my head and the man behind the gun asked me to pick one to prioritize despite all I've said. Well in that case, with a shaky voice and intense fear, I would say good data would be better prioritized simply because a normal-sized data can still give you powerful insight if the data is sparkling and squeaky clean. Big data that is not clean would simply be big for nothing like the results of using it would be unpleasant.
Conclusion
This is why I feel analysis of your data is pretty important if you're going to be using the data to train a machine learning model. This will ensure the data is good enough to deliver the best result.
This is a very tricky topic though, let me know what you think in the comments. If the gun was pointed at your own head, what would you pick and why?
Thank you for taking your time out to read, you deserve a a slice of pie or whatever these kids are eating




