How to Get Started in Machine Learning in High School
Yeah, machine learning is cool, but before that comes the data.
When most people hear of Artificial Intelligence (AI), their thoughts go to books like “I, Robot” (Isaac Asimov) or movies such as “Terminator 3: Rise of the Machines”. However, that couldn’t be further from the reality – at least today’s reality.
I, myself, am a huge Asimov fan. I remember reading his books when I was nine or ten, and even today I catch myself rereading some of his stories. I remember dreaming of programming robots back then. Today, as a 17-year-old girl, studying at a mathematics high school, I found what I love, it may not be robots, but it is close. Here, I'll discuss how to get started in Machine Learning.
What is Machine Learning?
Machine learning is a method of analyzing data, leading to automating analytical model building, based on the idea of algorithms learning from the data, identifying patterns, and improving their performance with every iteration.
ML is all around you, i.e. the recommended Netflix series you get are a result of an algorithm analyzing everything that you’ve watched and based on it suggests similar shows and movies. Such algorithms might even be using external data from users with similar profiles (with close interests to yours), or even your Instagram “Suggested” section, filtering every post you spend more time on, every profile click, forming your page in a way you find intriguing causing you to spend more time on the app. The conglomerate’ goal is accomplished, right?
How can I get started in Machine Learning?
There are many ways to get started in machine learning, and Inspirit AI is one of them (the pre-collegiate program). There you get a crash course on Python, the basic libraries used in ML, followed by the fundamentals and brief introduction to all of the methods, from linear and logistic regression to all types of neural nets, all of that guided by a mentor, who has graduated from an Ivy League university.
Another way I would recommend to kick your AI journey off, is to start a free ML course (i.e. Andrew NG- Stanford is a great option).
What is the importance of data in ML?
However, even if you can write your algorithms, they have to iterate over something - a dataset. You can't get started in machine learning without a dataset.There are some prepared and ready-to-be-used datasets such as the MNIST one, that you probably have heard of. In case you haven’t, it contains the digits written by hand and could be used (in the most simple case) for the algorithm to differentiate and learn the specifics of each one. However, in most cases, you have to prepare your dataset on your own in order for it to be used. Depending on the data needed your “data_loader.py” file may vary from 100 to more than 300 lines of code.
PRE-PROCESSING YOUR DATA
The needed pre-training processing involves loading the dataset with a library of your own choice, which could be pandas, as it supports tables /.csv formats, and makes the process easier than the alternative of rescaling your data. Depending on the type of dataset you use (table with labels, time-series dataset....) there are some minus/plus steps that are required.
Another case that probably will occur with a bigger dataset is missing values. It’s important for it to be handled as the ML algorithms do not support it. There are some solutions such as deleting the row with the corrupted data, writing algorithms that predict the missing data beforehand, imputing missing values with mean/median, and more, it’s up to you to decide.
Also sometimes there are values that are far from the median and the mean, called outliers, which need to be removed. Many other preprocessing may occur as well.
POPULAR ML LIBRARIES
There are some cool libraries used in ML, such as PyTorch and Tensorflow. As a user of PyTorch, I’ve found it helpful not only in the algorithms but also in the preprocessing of the data as well, as it supports importing from its Dataset and Dataloader classes.
As a last word, do not underestimate the importance of your dataset preprocessing and optimization, as without it your algorithm can’t produce much. Datasets are every important for getting started in machine learning.