So you’ve decided to jump into the world of Machine Learning & Data Science? Of course, A.I is also there but that is a bit towards the far stretched corner and to get there you might need to cover a lot ground before. In order to get your journey started, this blog sheds light on every ground root, every prerequisite which will help you build that foundation you’ll probably need before you jump onto the big guns.
What is the difference between Machine Learning & Data Science ?
Machine learning is the study of computer algorithms that improve automatically through experience. It is actually a subset of artificial intelligence. What ML orients around when it comes to code is writing algorithms which optimizes the input data, learn from data, identify patterns, make decisions & calculate desired outputs.
Whereas, Data Science is somewhat different when it comes under general assumption from many tech enthusiasts. It is significantly more, It lies at the intersection of Math, Statistics, Artificial Intelligence, Software Engineering and Design Thinking. Data Science deals with data collection, cleaning, analysis, visualization, model creation, model validation, prediction, designing experiments, hypothesis testing and much more, which predominantly lays the foundation for someone to get started with this technology.
Lets get to where we’re supposed to be….
To begin, you should have a considerable amount of knowledge of either Python or R programming languages but Python is largely recommended. Although R is widely regarded for it’s features for visualizations & statistical analysis, Python is a very popular language which is highly used for deployment purposes. Not that it lags behind R, Python also have some amazing libraries (which we’ll covered down below) which are avidly used for the same purposes.
Proceed in this sequential manner to get a grip on major concepts in future.
1. Numpy 🔢 : NumPy is a python library used for working with arrays. It contains multi-dimensional arrays and matrix data structures. It can be utilised to perform a number of mathematical operations on arrays such as trigonometric, statistical, and algebraic.
2. Pandas 🐼 : Pandas is mainly used for data analysis. This library is built on top of Numpy. Pandas allows importing data from various file formats such as comma-separated values, csv, JSON, SQL, Microsoft Excel. Pandas allows various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features.
3. Matplotlib 📊 : Matplotlib is a plotting library primarily used for data visualization. This library is also an extension of Numpy. This library is used for plotting various graphs such as Line plots, bar graphs, Pie charts and other figures.
4. Seaborn 📈 : Seaborn is a Python data visualization library based on top of matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Matplotlib generally consists of basic plots but Seaborn, on the other hand, provides a variety of visualization patterns. It uses fewer syntax and has easily interesting default themes. More advanced plots like Heatmaps, Box plots, Histogram, Scatter plots & many more.
5. Sci-kit Learn 💻 : Scikit-learn is a machine learning library for Python which features various algorithms like support vector machine, random forests, and k-neighbours in a pre-built form.
Once you’ve got a decent idea about the above stuff you can jump onto the next pool.
Data Analysis (EDA) 📝
All the libraries are used in this phase.
Exploratory Data Analysis refers to analyzing data sets to summarize their main characteristics, often with visual methods. It is a critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis, find outliers, look for trends and to check assumptions with the help of summary statistics and graphical representations.
Probability ⏳
Probability plays an underlying part inside the world of Machine Learning. It is used to judge possibilities, perform Hypothesis & categorize data on the basis of there mathematical distributions.
I’ve attached a detailed Github repository which consists optimized explanations on all the stuff talked above. Each notebook inside the repo contains the most commonly used functions/practices which are required in assessment of data sets.
https://github.com/Dhruv-VINT/ML-foundations 💡