My Journey Into Data Science

Quite a number of people have asked me about my switch from Chemical Engineering to Data Science. How did I do it? When did I do it? Why did I do it? I felt today (January 6, 2018) was a befitting day to answer these questions as it marks the third year since I enrolled for my first programming course. I hope sharing my story would give some insight into what I did to become a Data Scientist and encourage budding “anythings” everywhere to pursue their passion fiercely.

My first exposure to Data Science was from a book that had nothing to do with Data Science

In March 2014, I stumbled on a book called The Power of Habit: Why We Do What We Do in Life and Business by Charles Duhigg. In a section of the book called The Habits of Organizations, Charles wrote about a large retail chain that used data on what a female customer bought to predict the likelihood that she was pregnant. To put it lightly, I was mind blown and I had to find out more.

I searched everywhere for what this sorcery was called. After a few months and with the help of my friends, I stumbled on something very similar to what I read in The Power of Habit. It was called was Business Analytics.

This discovery came at a tipping point for me because, at the time, I was in my final year of college and had just finished an internship with an Oil & Gas company. My experience there made me weary of taking up Chemical Engineering as a career because I felt like it just wasn’t for me. This realization also made me open to new challenges and pivoting career wise. Business Analytics seemed to fit right into that.

I created my first Data Science learning path from an answer on Quora

By 2014, I had graduated and began my National Youth Service Corps. During my NYSC, I stumbled on Quora from a Twitter recommendation and I loved it.

In case you are wondering, IDEALLY, NYSC is a one-year mandatory program in Nigeria where you are deployed to a state you aren’t affiliated with to serve in some capacity as either a government worker, teacher or anything else really.

On Quora, I found out that Business Analytics had many names and one was Data Science. I also found a very helpful answer which I recommend to this day for anyone looking to start out as a Data Scientist: How can I become a Data Scientist?

This answer helped shape my first ever learning path for Data Science in January 2015 (Forgive my terrible handwriting).

Written January 2015. Other courses on the left side of the page are The Analytics Edge and Google Analytics

 

I completed 15 MOOCs on Data Science within a year

I primarily learnt Data Science through online courses. I never used a book (I tried). All the courses were free (because I didn’t care for a certificate) and where they were not free like Coursera, I got 100% Financial Aid.

I kissed a lot of frogs when it came to online courses so if you are looking for a loose guide on how to get started in Data Science I’ll save you the stress and focus only on the courses that were worthwhile.

1. Learnt Programming

This was the very first thing on my learning path and the scariest of them all. It was scary because I didn’t have a Computer Science background and the only time I was exposed to programming in College, I absolutely hated it. However, this time I felt I had all the time in the world and nothing to lose so I enrolled for Codecademy’s Learn Python course.

The course was so hard and a lot of it did not make sense to me. I could spend as much as two weeks trying to get a while loop to work and I had no idea what file I/O meant but by sheer brute force, I completed the course.

This was the first time I completed an online course after numerous attempts to do so previously. That gave me some confidence to keep on learning.

2. Learnt core Data Science

A lot of people ask me why I choose to use R over Python. It was by sheer coincidence that my first exposure to Data Science was in R from a course called The Analytics Edge from MIT on edX.

The ten-week course uses a case study approach to teach different parts of Data Science from Machine Learning to Visualization to Optimization using R. It was very demanding and very rewarding. The amazing experience I had on this course is what makes me lean a bit more to R than Python. The course gave me a great foundation and I still refer to my notes from 2015 sometimes.

3. Other helpful courses

Another course I loved, which I took towards the end of 2015, was Data Visualization and Communication with Tableau from Duke University on Coursera. It’s a five-week course that gives a great foundation on the use of Tableau. The instructor is amazing and the best I’ve been exposed to so far.

The next on my list would be Managing Big Data with MySQL from Duke University on Coursera. It’s a four-week course with the same amazing instructor as the Tableau course and teaches both MySQL and Teradata.

Others worth mentioning are: Introduction to BigData with Apache Spark (A four course series) from UCBerkeley on edX and Excel for Data Analysis and Visualization from Microsoft on edX.

How I started my blog — where the real learning started

If you read a lot of Quora answers or articles on how to become a better Software Engineer/Data Scientist/Designer and the likes, you’d see a recurring advice: Do personal projects to deepen your skill set. I had tried to do that a few times in 2015 but I wasn’t able to do anything reasonable because, frankly, I was not ready.

By 2016, I had slowed down on online courses because 90% of the courses had the same content and assumed you’re a beginner so it became a bit repetitive. By this time, I felt I was ready to start doing personal projects using a blog. The writing part was not an issue because I used to write in High School. My issue, however, was around consistency and creativity. Was I creative enough to put together interesting projects and could I do it consistently? You never know until you try, right? And that’s how I started my blog The Art and Science of Data in June 2016. My learning grew exponentially working on the content for my blog.

I wrote my first two posts within a month and then went on a year-long hiatus

My first post was Predicting The English Premier League Standings which I posted in September 2016 and then What Twitter Feels about Network Providers in Nigeria which was posted in October 2016. The amount of positive responses absolutely floored me. I got about 1,500 views and numerous responses on both posts and for the first time, I felt confident in my skills.

This experience taught me that creativity is not some talent that you either have or don’t. Creativity is born by experience and confidence in your skills because the possibilities of what can be done expands with the more you know.

Then I went on a year-long hiatus on my blog. This happened for many reasons.

  1. I had tried to write a blog post in December 2016 that was a hot mess. I cleaned it up later and used it for my Women in Machine Learning and Data Science Workshop called The ABC-XYZ of Data Science.
  2. After that, I had what I’ll call “The Data Scientist’s block”. I literally had no ideas and could not think up anything useful or interesting.
  3. My approach to my blog is a bit different from most data science blogs because mine involves a lot of research and iterations. It also makes my publishing cycle much longer than others.
  4. Work was grueling and adulting was catching up with me so I became a couch potato.

I finally had an idea in June 2017 on billionaires and with the help of my friends, I published A Data Driven Guide to Becoming a Consistent Billionaire in October 2017 (yes, it took me four months to put it together).

Within three days of publishing, it had 30,000 views. It was everywhere. A sizable number of sites plagiarized the post and I didn’t care. My work was good enough to be plagiarized!

My Little Victories So Far

Apart from the 40,000 views I’ve gotten so far on my post A Data Driven Guide to Becoming a Consistent Billionaire2017 was an interesting year for me. For the first time, the work I have put in for the past three years was being validated.

  1. I won a United Nations Data Visualization Contest with my Tableau visualization on “Visualizing Malaria: The Killer Disease Killing Africa” which looked something like this.

2. I got invited to speak at Stanford’s Women in Data Science Conference holding in Nigeria on the exact same topic as this post.

3. I have numerous collaborations lined up for 2018 both in Nigeria and abroad.

4. I facilitated a workshop at The Women in Machine Learning and Data Science in November 2017.

Truthfully, I’m a bit surprised that I got this far. I remember writing in my notepad “Rosebud, you will never be good enough for this” but here I am. I still have a lot of learning to do but I am also grateful for where I am today.

My Advice for You

I’m no expert neither am I John Maxwell who gives nuggets of self-help advice but here are a few things that have really helped me.

  1. Don’t be afraid to let go of something that’s not working out. It took me till 2016 to fully let go of my Oil & Gas dreams even though I knew I was not passionate about it.
  2. Don’t be afraid to be called crazy. I cannot count the number of times people subtly and not-so-subtly told me I was crazy for leaving Chemical Engineering especially when Data Science was relatively new in Nigeria. It used to get to me but now I smile and say to myself “When I blow, you’ll understand”.
  3. Read. Read. Read.The books that opened up this field to me had nothing to do with Data Science. Reading expands your realm of possibilities.
  4. Love to learn. Have learning goals every year and stick to a medium (books/audio/video/classroom) that works best for you.
  5. Always, always put your best foot forward. Let the work that you put out there be the very best work it could be. It would speak for you. 99% of the opportunities I have gotten today came, in part, because of my blog.
  6. Most importantly, you are not an island. Have a tight-knit support system that would tell you the truth even when it hurts. You’d be better for it.

Good luck 🙂

I want to especially thank my amazing support system and all the people that got me here. They are too numerous to mention but I love you guys so much. I want to especially thank Tobi, Didun and Miracle for the support, the tough love, the brutal feedback and telling me where exactly to put an apostrophe. You have been there from day 1. You know all my struggles. You saw me at the very beginning and still believed I could do it. Thank you for making a better Data Scientist and a better person. I wouldn’t trade you for the world.

 

Advertisements

Part Two on Consistent Billionaires: Introducing The Surprise

In my last post, I spoke about a certain surprise I had to share so here it is……

*drum rolls*

*roll sleeves*

*cracks a knuckle or two*

It’s a web app called Billion Dollar Questions!

billionaireapp

It’s a simple and fun web app that anyone can use to predict what sort of billionaire they’ll become. Simply tell the app who you are and a model runs its magic and tells you your future billionaire status. You can share your prediction on Twitter and Facebook to rake up cool points (if you are going to be Consistent anyway).

At this point, I think I should say that I can in no way guarantee you’d become a billionaire. My skills border around Data Science not making money rain.

Here’s how to use it

Before you go any further, I highly recommend that you read my last postThat way, a lot of the stuff on the app would be familiar to you.

Using the app is pretty simple, fill the form in a way that best describes you, click “Predict” and in a few seconds, the app would tell you what sort of billionaire you’d become. Here’s a GIF on how it works:

Hustler

You can also use it on your desktop, tablet or mobile device!

Now that you’ve seen how it works, here’s the app: 

theartandscienceofdata.shinyapps.io/billiondollarquestions/

Interested in How I did it?

My work is divided into two parts and can be found on my GitHub repo here:

R’s Shiny

Shiny is an amazing tool from Rstudio that gives you the ability to create R-driven web apps which can be easily deployed for anyone to use without ever having to touch code. A Shiny app usually has three parts:

  1. The UI: This is what you see at the front-end made up of R-wrapped HTML, JS and CSS.
  2. The Server side: This basically your usual R code. All R calculations, functions, scripts are run server side. In my case, this is where the model takes all your inputs, converts it to a dataframe and carries out predictions.
  3. Global: This is optional and it is used to declare variables globally which are to be accessed by multiple objects/functions. It is advisable to only use this when necessary because such variables or objects are loaded at runtime and if they are large or take too much time, it can slow down the loading time of your app. In my case, I read in the original dataframe here as well as the model which I used for the app since both objects would be needed by multiple functions.

My UI, server and global variables are all in the app.R file in the GitHub repo shared above.

Some Custom HTML and CSS

Shiny provides a great way for Data Scientists to code up nice web apps without having to know how to use Front-End tools like HTML, CSS and JavaScript. However, if you want more control over your app, you just might need to know a thing or two on how to use those Front-End tools. The good news is, Shiny lets you create these things pretty easily. I wrote custom code using the HTML() function and within it, I can put in my custom HTML exactly the way I would if I was building a website. I also had a custom stylesheet called style.css to give my app the feel I wanted and make it mobile friendly with a few media queries. I also used the famous animate.css library to make my app look fun (you can see all the buttons jiggling away).

Things to Keep in Mind

A number of people asked me why I used h2o and not R’s famous caret for my machine learning. The answer is: it was the use case. The billionaire data had a significant amount of missing values and had variables with over 50 different categories. These two things are what most machine learning algorithm implementations in R don’t deal well with and h2o handles both gracefully. You can check out h2o’s implementation approach here.

The  code that I used to create the final model used on the app along with some interesting research which did not introduce at this time, can be found here.

 

A Data Driven Guide to Becoming a Consistent Billionaire

Did You Really Think All Billionaires Were the Same?

Recently, I became a bit obsessed with the one percent of the one percent – Billionaires. I was intrigued when I stumbled on articles telling us who and what billionaires really are. The articles said stuff like: Most entrepreneurs do not have a degree and the average billionaire was in their 30s before starting their business. I felt like this was a bit of a generalization and I’ll explain. Let’s take a look at Bill Gates and Hajime Satomi, the CEO of Sega. Both are billionaires but are they really the same? In the past decade, Bill Gates has been a billionaire every single year while Hajime has dropped off the Forbes’ list three times. Is it fair to put these two individuals in the same box, post nice articles and give nice stats when no one wants to be a Hajime? I think not – especially when, in this decade alone, inconsistent billionaires like Hajime make up over 50% of the total billionaire population. Addressing the differences between billionaires is what this post is about. We are going to highlight interesting facts about the consistent billionaires and ultimately, find out what separates the consistent billionaires from the rest.

Just what do I mean by consistent billionaires? Well, that’s what we’re here for. 🙂

For the Nerds Like Me, Here’s How I did It

  • Data Sources: Most of the data was scraped from 3000 Forbes profiles. Two extra variables were collected from a research paper: The Billionaire Characteristics Database. Billionaires covered are those who are or have been billionaires between 2007 and June, 2017.
  • Data Gathering: Using names of billionaires I created their Forbes profile URLs and collected the data I needed using RSelenium and rvest. I’ll be frank. It was not sexy at all. I did a lot of Excel VLOOKUPS, manual inspections and string manipulation to get a workable data set.
  • Data Cleaning: I created columns from strings using stringr.

The code can be found here.

Just How Many Types of Billionaires Are There?

Here’s what I came up with:

  • The Consistent: These, as the name implies, are individuals who have consistently been billionaires year in and year out. It also includes billionaires that have been away from the list for at most a year (e.g. Mark Zuckerberg in 2008). They should have been billionaires before 2015.
  • The Ghosts: These are billionaires who left the list and have not returned in the past four years. They also should have made their debut before 2015.
  • The Hustlers: This category includes every other billionaire who made their debut before 2015. I.e.
    • Those that left more than once and made a comeback each time.
    • Those who, although made it back to the list, spent more than a year away.
    • Those who are yet to come back but have not spent up to 4 years off the list.
  • The Newbies: These are billionaires that made their debut between 2015 and 2017. They are in a group of their own because I believe it would be unfair to put them in anywhere else as there isn’t enough data to classify them in any other category. Nonetheless, I think it would be interesting to see what they’re up to.

So, let’s get to it!

Did You Know That?

The Consistent billionaires are well-educated.

Close to 55% of the Consistent billionaires have at least one degree.

Billionaire education

In fact, the Consistent billionaires have the most people with a Bachelor’s, PhD, Masters and pretty much every other degree.

The average Consistent billionaire started their businesses at an age seven years older than the average Ghost.

This applies to billionaires who are self-made and started a business. The average Consistent billionaire starts their business in their 30s on average which agrees with the article on successful starting their 30s.

Age at Start

Does the Ghost billionaire starting his/her business at least two years earlier than everyone else say something about younger entrepreneurs being less likely to sustain their wealth? Probably. However, if you look at the Newbies, they mostly started out young too. The question is: Will the average Newbie end up a Ghost or has the playing field changed in the past few years?  We can answer that in a few years. 🙂

The top three sectors that produce the highest percentage of Consistent billionaires are Telecoms, Fashion and Diversified portfolios.

Consistent Sectors

Looks extremely mainstream, right? But Fashion? Really?

Note: Fashion and Retail here does not mean Retail. It means businesses retailing Fashion merchandises like Zara, H & M etc.

African billionaires are the most likely to be Consistent billionaires

Close to 70% of African billionaires are Consistent – more than any other region in the world. The region that comes closest is North America with 53%.

Consitent Region

In the Newbie Era, however, Asia seems to be dominating every other region and this number is mostly driven by China. In fact, over 50% of Chinese billionaires joined the list during this period.

On the other hand, Middle Eastern billionaires are the most likely to be Ghosts. I know what you’re thinking. Oil prices, right? Probably. However, most of Middle Eastern billionaires have diversified portfolios.

There are more billionaires with a PhD than there are drop outs.

This is my favorite.

This applies to all other degrees like MBA, MSc etc. Only professional degrees like Law or Medicine have fewer billionaires than drop outs. However, in the Newbie and Hustler categories, there are even more people with a professional degree than there are drop outs.

Billionaire Degree.png

11% of Consistent billionaires are female.

Female Billionaires

The only category with a more encouraging female-to-male ratio is the Newbie category with about 16 percent. However, given that the global male to female ratio is 50:50, the Newbie category is still 34 percent short. The good news is things are getting better. A woman is close to two times more likely to be a billionaire since 2015 than before that.

64% of Consistent billionaires are self-made.

Self Made Billionaires.png

The only category with a lower percentage is The Ghost. The good news (or bad news – depending on where you hope your wealth would come from) is that the Newbie billionaire has a higher percentage than that. This means that in recent times, more “new” wealth is being generated. Also, it seems being self-made isn’t a peculiar thing seeing as each category has over 60% of their billionaires being self-made.

Cool, Now What?

The billionaires we all know and love are well-educated and frankly, generally boring.

How much does this matter if you want to become a Consistent billionaire?”

To answer that, we will do a bit of Machine Learning (bear with me here, it might get a little technical). Using the h2o.ai machine learning package (I love!), we would train models to predict what category a billionaire will fall into. We would do this for all the categories except The Newbie because, unlike the others, all that distinguishes this group is when they joined the list and not their performance while on it. We would also use truly independent variables to train our models. For example, a variable that was used to create the categories like the number of times they left the list won’t be used. It would be like knowing the answer and working backward if we use variables like that, right? We would then check which variables were the best in predicting a billionaire’s category to answer our question. The code is also available in the same script shared above.

I would first use the purrr and h2o package to find the best algorithm between Gradient Boosting Machines, Random Forest, and Deep Learning.

Models

Looks like the accuracy of the GBM algorithm on the test set beats the other machine learning algorithms.

Let’s check what variables GBM considers most important in predicting a billionaire’s category.

Variable importance.png

We see three variables above the 50% relative importance: Country, Sector and the founding year of the company that got them their wealth.

What does this tell us about Consistent billionaires? For one, it says that while the Consistent may be well educated, that’s certainly not what got them there. It’s not shocking that Country and Sector are important variables but “founding_year” is intriguing. It could mean that it may be getting easier or harder to build a sustainable business.

Again, pretty straightforward and boring. Be in an enabling environment at the right time for the sector you play in and BOOM! You make sustainable wealth. At this point, I feel I am obligated to say that 84% of technology billionaires are in North America and Asia. There are currently none from Africa (See sentence above about an enabling environment for your sector) but then again, you can be the pioneer so take my advice with a bag of salt. Good luck!

Things to Keep in Mind

  • The data was gotten from Forbes. This means that I am inherently constrained by their methods, estimates, and errors. For example, the data says there is only one billionaire from Politics. I’d rather diezani than believe that’s true.
  • At the end of the day, I ended up with over 30 variables and I cannot talk about all of them in one post, so here are some visualizations for you to play around and find out for yourself how to become a Consistent billionaire. 😉
  • Want to find out who the Consistent billionaires are? Find out using the full data set here.
  • In my next post, I am going to address what sectors, countries and founding years are the best in becoming a consistent billionaire and;
  • I have a LITTLE surprise. 🙂

What Twitter feels about Network Providers in Nigeria

Disclaimer: This post is a personal effort and is not in any way advetorial for any party involved.  It does not reflect the views of my current, past or future employers.

You know that amazing feeling when your network provider gives you great call rates, cheap data bundles that last, amazing network quality and awesome customer service? No? Yeah, me neither. If you are like me and most people I know, you are probably in a love-hate relationship with your network provider. It’s safe to say that all network providers are frustrating. However, some are more frustrating than others.

The question is: Can we determine which network provider is not going to be that frustrating? Based on the sort of phone I use; can I say if I am going to prefer Etisalat to Airtel? Continue reading

Predicting the English Premier League Standings

Before I begin this post, I would like to point out that I am the most disgruntled Arsenal fan you’d ever meet. Whatever subliminal messages or shade I could be throwing to your team, it’s (mostly) not to hurt you. We Arsenal fans have to find joy in other places seeing as there’s a good chance we might not make Top 4 this season. Take solace knowing that all I say, I say for the love of the game. Happy reading!

Football is a beautiful sport. The adrenaline rush we get from watching our team score in injury time or the embarrassment we feel when our team not only loses the match, but decides to concede 8 goals in the process (Yes, I am looking at you Arsenal) is part of what makes us addicted to the game. What makes football, especially in England, even more interesting is the uncertainty. You can get all the top players from all the top leagues in Europe or even sign a world class manager who almost won his country the World Cup and still not finish Top Four. Continue reading