Good Data Vs Big Data, Which is More Important?

In this article, we'll be taking a break from math equations and lines of code to discuss a topic that comes up fairly often in the data industry. To have an overview of what we'll be discussing, do take a look at the table of contents at the beginning of this article. 👆

What is Data?

Untitled (1280 × 720 px) (1).gif Before we delve into anything, we should, first of all, know what we are even talking about. At the basics, data can be referred to as raw information as I learned when I was still a kid. According to a post on GeeksforGeeks, in the real sense of it, data is any unprocessed fact, value, text, sound or image that is to be interpreted or analysed.

In machine learning, data is very important and trying to make sense of it is the whole essence of what we do, which means we need it big and we need it good. This leads to the main terms of interest in this article.

What is Big and Good Data

sas.com defines big data as large and hard-to-manage data which can be structured or unstructured. It is so large or complex that it is difficult to process this kind of data with traditional methods. Big data allows you to get better insight from data as it incorporates a lot of fine-tuned information within it.

That being said we should consider what we mean by good data. Good data refers to data that is highly accurate, complete, conformant to storage standards, consistent, unique and valid. It is often referred to as clean data (I shall be using both "good" and "clean" interchangeably in this article).

And now the big question:

Which is More Important?

Now that we know what big data is and what good data is, which should be considered a priority? I wish I could just give a straight answer but it's not that simple. I will break it down into 3 parts.

Part 1: It depends on the application

Untitled 2(1280 × 720 px).gif It really does depend on what you would like to use your data to do. For example, if you just wanted to do some analysis of the data to develop some insight, you might consider big data a priority rather than clean data whereas someone else who uses these data to build machine learning models might consider clean data a priority as feeding very bad data in your model during training can give undesirable results.

Untitled 3 (1280 × 720 px).gif Even in the case of building machine learning models, it could still depend on the kind of algorithm you're using to build. A statistical machine learning algorithm like linear regression, logistic regression, or decision trees would work really slowly with big data and would take a long time to train a model. In this case, an ML engineer or data scientist might decide to prioritize clean data. Deep learning algorithms, on the other hand, require huge volumes of data to work efficiently and an ML engineer or data scientist might prioritize big data in this case.

Part 2: Ideally, none is more important

Untitled 4 (1280 × 720 px).gif If we should look at the problem from the ideal perspective, both of these data properties (good and big) should be balanced out. Big data that is relatively bad would give inaccurate results and in the same vein, good data that is relatively just too small will not give the expected result. So the best thing is to work towards getting data that is big in terms of volume, variety and velocity;

Volume: Large amount of data.
Variety: Different data points with respect to areas of concentration.
Velocity: Data is created at a rapid rate.

And not just big but also big data that's good in terms of;

Accuracy
Completion
Conformance to storage standards
Consistency
Uniqueness
Validity

Part 3: Gun to my head, pick one?

Say a gun was pointed at my head and the man behind the gun asked me to pick one to prioritize despite all I've said. Well in that case, with a shaky voice and intense fear, I would say good data would be better prioritized simply because a normal-sized data can still give you powerful insight if the data is sparkling and squeaky clean. Big data that is not clean would simply be big for nothing like the results of using it would be unpleasant.

Conclusion

This is why I feel analysis of your data is pretty important if you're going to be using the data to train a machine learning model. This will ensure the data is good enough to deliver the best result.

This is a very tricky topic though, let me know what you think in the comments. If the gun was pointed at your own head, what would you pick and why?

Thank you for taking your time out to read, you deserve a a slice of pie or whatever these kids are eating

Untitled (1280 × 720 px).jpg

Good Data Vs Big Data, Which is More Important?

What is Data?

What is Big and Good Data

Which is More Important?

Part 1: It depends on the application

Part 2: Ideally, none is more important

Part 3: Gun to my head, pick one?

Conclusion

Comments

More from this blog

Mastering Feature Reduction: How mRMR Helps Machine Learning Models Cut Through the Noise

9 Important Concepts You should Understand In Association Rule Learning

Mastering the Concept of LOGITS in Machine Learning

Flask 101: Writing and Understanding Your First Flask Code

Command Palette

What is Data?

What is Big and Good Data

Which is More Important?

Part 1: It depends on the application

Part 2: Ideally, none is more important

Part 3: Gun to my head, pick one?

Conclusion

Comments

More from this blog