How to Start a Machine Learning Project

9 min readJul 25, 2021

This article is directed at anyone interested in the practical application of machine learning and/or anyone thinking about starting a project of their own. If you don’t fall in one of those categories, you may not benefit from this writing.

The goal of this is to provide a chronological checklist of all the steps in taking on a project like this as well as a detailed explanation of each of the identified stages involved.

Starting a machine learning project can be an exciting yet difficult task, and without a plan, you basically asking to fail no matter how good you are. To develop a successful project, you’ll need a lot of expertise and learning to properly manage all of the processes in this article. However, by following a few rules, we can better structure our project and save ourselves from a lot of maintenance and trouble.

With that said let’s start.

Identify

The first and perhaps the most important of the 8 steps is the identification of your problem. You would want to have a clear plan on your idea and what you would need to do to solve it. You can find a project that has a useful application that can solve a real-life problem. The project also needs to be doable. Some examples include disease detection, movie and song recommenders, object detection, and flower classification. These are just a few interesting projects that have been done in the past.

Data Collection

After coming up with a solid project idea the next most important step is collecting the data. Usually, there is some sort of dataset available for your specific project somewhere online(For example if you are building a movie recommender you could use the MovieLens dataset). In fact, they are entire websites for datasets like Kaggle.

But in the instance that there isn’t a dataset available you would have to collect your own data. In a modern world, data is the new oil, if you are able to collect your own data you are sitting on a hidden gem of your own. Good data is hard to come by for free and companies are willing to pay millions or even billions for data.

If you are building a flower classifier you could go to gardens or fields to take pictures of flowers and measure out the length of petals, stems, and sepals.

Or if you’re building a cancer detector in the lungs, you visit local hospitals and try to collect data from the corporate.

It’s hard to create data, but once you get the data you can import it into your program via an excel spreadsheet or some other file and actually start programming.

Data Cleaning

The process of ensuring that data is correct, consistent, and useful is known as data cleaning. You can clean data by detecting mistakes or corruptions, repairing or removing them, or manually processing data as needed to avoid recurrences. Sometimes when given data (especially free data) there are empty columns and/or outlying values. If they are empty rows or columns those can be easily deleted from the dataset, but outlying values are a little harder.

Outliers in your data are values that are a lot more under or over-exaggerated than everything else. Those can often provide misinformation that will dilute a more accurate prediction.

For example, let’s say a basketball player is going to join the NBA Lakers and wants to make an algorithm to predict how many points he is going to score. An outlier in the laker data would be Kobe Bryant (Rest in) Peace. He was so much better than the average player and it is likely that our basketball is not going to be anywhere near Kobes skill level. Therefore Kobe is an outlier.

Building a Model

The next most important step is choosing a machine learning model to use (or inventing one) and actually coding it. Now if you are a beginner you may not know a lot of models so here is an abbreviated guide to some of the most commonly used machine learning models. If that doesn’t satisfy you, you can check out this article. But before that, it is necessary to talk about the two categories that these models fall under, supervised and unsupervised learning.

Supervised learning involves learning a function that connects an input to an output based on example input-output pairs.

Unlike supervised learning, unsupervised learning is used to draw inferences and find patterns from input data without references to labeled outcomes. Two main methods used in unsupervised learning include clustering and dimensionality reduction.

Alright, now that’s out of the way we can break down machine learning models.

Linear Regression

Linear Regression is the simplest model that there is, say that there are two variables input and output. Each of those pairs is plotted (in blue) and the algorithm tries to fit a line (in red) that best corresponds to the data.

Decision Trees

Decision trees are the easiest to understand, it’s basically a simple series of logic doors. Each door is called a node, and the more nodes you have, the more accurate and specific the prediction will be.

Random Forest

Random Forest is actually what its name implies, it is a forest of trees. Random Forests are multiple decision trees and picking random decision trees with a forest provides a much more wide-scale prediction, therefore, making a more accurate prediction.

Neural Network

A neural network is a simple framework designed after the human brain. It allows for a program to take in certain data inputs and run some sort of calculation on them to produce an output, just like a brain processes information to deliver an output. The input comes in the form of a neuron and they are connected through a series of synapses(which are where the calculation takes place to get from one layer to another). A simple neural network has 3 types of layers, the input layer, the hidden layer, and the output layer. If there are multiple hidden layers that are called deep learning.

Clustering

Clustering is an unsupervised technique that groups and clusters similar data points. It’s usually used in anything involving the term segmentation. It’s frequently used for customer segmentation, image segmentation, and file segmentation. In the two images, it splits all the data into colored clusters.

Logistic Regression

Logistic Regression is similar to Linear Regression except instead of a best-fit line with infinite predictions, it is a best fit ‘S-curve’ with only two above half or below half.

Now, these are some of the biggest tools in Machine Learning, with just these models alone you could do a lot of different types of projects but in the case, you want to learn more about models you can also see a few other lesser common models,

Support Vector Machine
Naive Bayes
Dimensionality Reduction

Then they are more specific models, like if you’re doing an object detection project(which is highly advanced for beginners) you may use a CNN model. But for now, you should probably use one of the listed models.

Training

Now that you have a model built for your data you actually have to train your model, because initially your model will just have random values, and it will have to be tailored for your data.

First off you have split your data into two categories, training and testing. You can test your model on the same data you trained it on because you will always get 100% accuracy. You can split it however you want 70/30 or 75/25. It doesn’t really matter unless you use separate data.

Training a model simply means learning good values for all the weights(calculations or questions that are put up against data). So the input data is associated with the output data by some sort of calculation (or question when dealing with trees and forests). The training tries to find those calculations by figuring out how far off the predictions are with the current weights (calculated by expected output — actual output ). Then they are tweaked so that the results will be slightly closer.

In other words, the random weights are slowly changed based on how inaccurate they are until they are just accurate enough. But how do you know how accurate your model is? The answer is testing.

Testing

You can check how accurate your weights truly are by testing them out on the untouched testing data. Because you have already done all you can with the training data, this testing stage is vital to truly evaluate your model. Because you already have the real outputs you can compare them to your predicted outputs and manually tweak them until they are good enough for your personal liking. There are a lot of factors to consider at this stage in your project, and it’s critical that you define what constitutes a model “good enough,” otherwise you’ll end up changing parameters for a long time. The tuning, or adjusting, of these parameters, is still a bit of an art, and it’s more of an exploratory process that’s strongly influenced by the details of your dataset.

Prediction

Congrats you’ve finished your model. You have researched an idea, collected and cleaned the data, built a model, trained it, and tested it to perfection. Now you’re finally able to do what you’ve wanted to do, make a prediction.

You can ask the user for input, send it through the model and deliver an output based on your cognitive data science skills. With that said, you can send your predictions to anyone that might want it. For example, I have personally made a tool to predict to the outcome of football and basketball games, people that bet on these sports are willing to pay for a good prediction.

Once you start building a name for yourself in the industry, your predictions will be gold.

Deployment

The last and final step is to host your model online, you could upload your code to Github or use Flask or Django to turn your python code into a convenient web app. This way the public can actually benefit from your work and you will begin to build a name for yourself in the industry.

Conclusion

Machine Learning is difficult, but this article serves as a basic reference for all the main aspects of ML. This provides a strong basis for thinking about diverse machine learning issues by providing a common vocabulary to discuss each step and delve deeper in the future so that when reading articles and doing more research in the future you will actually understand the complex ideas they are discussing.

Good luck with all your coding endeavors!

Works Cited

“Build and Deploy Your First Machine Learning Web App.” KDnuggets, www.kdnuggets.com/2020/05/build-deploy-machine-learning-web-app.html.

G, Yufeng. “The 7 Steps of Machine Learning.” Medium, Towards Data Science, 7 Sept. 2017, towardsdatascience.com/the-7-steps-of-machine-learning-2877d7e5548e.

Karkare, Prateek. “Structuring Machine Learning Projects.” Medium, AI Graduate, 25 Jan. 2020, medium.com/x8-the-ai-community/structuring-machine-learning-projects-8b49cebbb9d5.

“Machine Learning Glossary | Google Developers.” Google, Google, developers.google.com/machine-learning/glossary#gradient_descent.

Rana, Ryan. “The Simplest Guide to Neural Networks.” By Ryan Rana — The Ryan Rana Publication, The Ryan Rana Publication, 19 July 2021, ryanrana.substack.com/p/the-simplest-guide-to-neural-networks.

Shin, Terence. “All Machine Learning Models Explained in 6 Minutes.” Medium, Towards Data Science, 17 Oct. 2020, towardsdatascience.com/all-machine-learning-models-explained-in-6-minutes-9fe30ff6776a.

Sunil, Ray. I am a Business Analytics and Intelligence professional with deep experience in the Indian Insurance industry. I have worked for various multi-national Insurance companies in last 7 years. “Commonly Used Machine Learning Algorithms: Data Science.” Analytics Vidhya, 23 Dec. 2020, www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/.