Journey to Machine Learning { Lesson 2 => Where is the data? }


Wow, It’s a long hiatus. Apologies for being so heavily distracted from you all. Being late to the party means we put up a lot of things to give you the right information. So yeah welcome back, we promise not to keep mute for a long time again. Before I dive into today’s topic, I’d like to give a big shout-out to Shola, one of our readers who reached out to me. Here’s a snapshot of our conversation. This means a lot to us, it keeps us on our toes to deliver great write ups. Thank you so much.

Screen Shot 2016-04-20 at 5.21.32 PM

People, please keep the nice comments coming :)

On today’s Tech Thursday, ….If you’ve not been following up on my journey to Machine Learning. Now is the good time to – over here.

So this week’s post has great answers to some of the major questions asked by newbies diving into Data Science. I must say I find the course interesting, thanks to the speaker  => Murtaza Haider. He never bores me; not for a moment. Let’s walk through the answers to the questions.

 Is there free data to get started in Data Science?

If you’re starting with learning Data Science, the major question asked is “Is there free data to get your hands on except that it comes with a little restriction and you got to have an open mind to it?”. There are several websites you can go to like The World Bank. Well for Nigeria, I’m not sure where for now. I really wish I have an answer to the how we can pull data(publicly available) in Nigeria system apart from websites like Twitter, Facebook, etc.

What should you consider when working with data from different sources?

When you’re retrieving data from the internet, there’s a wide variety of formats in which you collect data and you also have to check for quality and integrity. Be sure to avoid using dubious data but if you happen to work with raw data in which it happens that the authenticity is not clear to and you got no witness to claim the data set, the best thing to do is cleaning up the raw data while making sure you understand the missing values. Also, structure it in a way that you’re sure is available for full analytics.

What is closed data?

A large amount of a much larger amount of data is close data, they are mostly data from the governments in which they are kept within the organization because a lot of these data consists private information. In this case,the data scientists that work for this organization are the ones that access this data. However, some organizations make their internal data accessible to the public so that external data scientists can pull it and probably make a better analytics than what the organization has been able to achieve.

What is Big data?


Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.

 Do you need Big Data to Learn about Big Data?

The answer to this is YES and NO. :) Why right?

So let’s create an analogy; Let’s say Nana is a pretty damsel who loves to work out to keep fit. Her resolution for the year 2016 is to be able to carry a barbell of size 1,000kg. She’s left with the two options:

  • Start straight away with 1,000kg.
  • A smarter way will be to start with 50kg and move upward until her target is reached.

Both options are different ways she can hit her target but I bet most people will go with the second option. This same analogy applies to whether using big data to learn big data or not. Wisely, if you’re new to Data Science like me, you would want to start with smaller data sets. The major focus should not be on the volume of data but with the analytics of the data, manipulating the data,subjecting it to a variety of algorithms, generating reports and writing stories from it. These skills have to be developed first before you can confidently move forward to handling bigger challenges of big data.

What are the pitfalls of Big Data?

  • When you have very large datasets, almost all variables will by default be correlated and this can lead to false results.
  • Relying solely on data sets while ignoring other factors. An example is Google Flu.

Sources of Data

For examples of sites that handles Data Science, you might want to take a look at the following:

  1. Quandl; is a marketplace for financial and economic data delivered in modern formats for today’s analysts.
  2. Google is also a source of data – Google Trends , Google Correlate

Yes! And this brings us to the end of today’s journey into Machine Learning, getting more interesting right? No code yet but I have few beautiful quotes as a post giveaway about taking baby steps :)

The elevator to success is out of order. You’ll have to use the stairs… one step at a time – Joe Girard.

So no matter how far my goal is, no matter how small a step I’m taking, as long as I keep moving forward, someday I’ll get there!” – Eiichirou Maruo

One may walk over the highest mountain one step at a time  – John Wanamaker

Keep updated when a new post is launched by subscribing. See you next Tech Thursday.



You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *