Data Science is the domain of study that uses computational and statistical methods to get knowledge and insights from data. It utilizes techniques from mathematics, statistics, computer science and domain-specific knowledge to analyse large datasets, find trends and patterns from the data and make predictions for the future.


Data Science is different from other data related fields because it is not only about collecting and organising data. The data science process consists of analysing, modelling, visualizing and evaluating the data set. Data Science uses tools like machine learning algorithms, data visualisation tools and statistical models to analyse data, make predictions and find patterns and trends in the data.


Other data related fields such as machine learning, data engineering and data analytics are more focused on a particular thing like the goal of a machine leaning engineer is to design and create algorithms that are capable of learning from the data and making predictions, the goal of data engineering is to design and manage data pipelines, infrastructures and databases. Data analysis is all about exploring and analysing data to find patterns and trends. Whereas data science does modelling, exploring, collecting, visualizing, predicting, and deploying the model.


Overall, data science is a more comprehensive way to analyse data because it includes the whole process, from preparing the data to making predictions. Other fields that deal with data have more specific areas of expertise.

Q2. What is the data science process and what are the key steps involved?


A data science process also known as data science lifecycle is a systematic approach to find a solution for a data problem which shows the steps that are taken to develop, deliver, and maintain a data science project.


A standard data science lifecycle approach comprises the use of machine learning algorithms and statistical procedures that result in more accurate prediction models. Data extraction, preparation, cleaning, modelling, assessment, etc., are some of the most important data science stages. Key steps involved in data science process are −

Identifying Problem and Understanding the Business


The data science lifecycle starts with "why?" just like any other business lifecycle. One of the most important parts of the data science process is figuring out what the problems are. This helps to find a clear goal around which all the other steps can be made. In short, it’s important to know the business goal as earliest because it will determine what the end goal of the analysis will be.

Data Collection

数据科学生命周期的下一步是数据收集,这意味着从适当且可靠的来源获取原始数据。收集的数据可以是有序的,也可以是无序的。数据可以从网站日志、社交媒体数据、在线数据存储库中收集,甚至可以使用 API、网络抓取或可能存在于 Excel 或其他来源中的数据从在线来源流式传输数据。

The next step in the data science lifecycle is data collection, which means getting raw data from the appropriate and reliable source. The data that is collected can be either organized or unorganized. The data could be collected from website logs, social media data, online data repositories, and even data that is streamed from online sources using APIs, web scraping, or data that could be in Excel or any other source.

Data Processing


After collecting high-quality data from reliable sources, next step is to process it. The purpose of data processing is to ensure that any problems with the acquired data have been resolved before proceeding to the next phase. Without this step, we may produce mistakes or inaccurate findings.

Data Analysis

数据分析 探索性数据分析 (EDA) 是一组用于分析数据的可视化技术。采用此方法,我们可能会获取有关数据统计摘要的具体详细信息。此外,我们将能够处理重复数字、离群值,并在集合内找出趋势或模式。

Data analysis Exploratory Data Analysis (EDA) is a set of visual techniques for analysing data. With this method, we may get specific details on the statistical summary of the data. Also, we will be able to deal with duplicate numbers, outliers, and identify trends or patterns within the collection.

Data Visualization


Data visualisation is the process of demonstrating information and data on a graph. Data visualisation tools make it easy to understand trends, outliers, and patterns in data by using visual elements like charts, graphs, and maps. It’s also a great way for employees or business owners to present data to people who aren’t tech-savvy without making them confused.

Data Modelling


Data Modelling is one of the most important aspects of data science and is sometimes referred to as the core of data analysis. The intended output of a model should be derived from prepared and analysed data.


At this phase, we develop datasets for training and testing the model for production-related tasks. It also involves selecting the correct mode type and determining if the problem involves classification, regression, or clustering. After analysing the model type, we must choose the appropriate implementation algorithms. It must be performed with care, as it is crucial to extract the relevant insights from the provided data.

Model Deployment

模型部署包含为将模型部署到市场消费者或另一系统而建立的交付方法。机器学习模型也正在设备上实施,并且越来越受欢迎。根据项目的复杂性,此阶段的范围可能从 Tableau 仪表盘上的基本模型输出到拥有数百万用户的复杂的基于云的部署。

Model deployment contains the establishment of a delivery method necessary to deploy the model to market consumers or to another system. Machine learning models are also being implemented on devices and gaining acceptance and appeal. Depending on the complexity of the project, this stage might range from a basic model output on a Tableau Dashboard to a complicated cloud-based deployment with millions of users.

Q3. What is the difference between supervised and unsupervised learning?

监督学习 - 监督学习是一种机器学习和人工智能。它也被称为“监督机器学习”。其定义为使用标记数据集来培训算法如何正确分类数据或预测结果。在将数据放入模型时,模型的权重会发生变化,直到模型正确拟合为止。这是交叉验证过程的一部分。监督学习可以帮助组织找到解决各种实际问题的规模化解决方案,例如在 Gmail 中将垃圾邮件分类到与收件箱分开的文件夹中,我们在 Gmail 中有一个垃圾邮件文件夹。

*Supervised Learning * − Supervised learning is a type of machine learning and artificial intelligence. It is also called "supervised machine learning." It is defined by the fact that it uses labelled datasets to train algorithms how to correctly classify data or predict outcomes. As data is put into the model, its weights are changed until the model fits correctly. This is part of the cross validation process. Supervised learning helps organisations find large-scale solutions to a wide range of real-world problems, like classifying spam in a separate folder from your inbox like in Gmail we have a spam folder.

Supervised Learning Algorithms - 朴素贝叶斯、线性回归、逻辑回归。

Supervised Learning Algorithms − Naive Bayes, Linear regression, Logistic regression.

Unsupervised learning - 无监督学习也称为无监督机器学习,使用机器学习算法查看未标记数据集并将其分组在一起。这些程序查找隐藏的模式或数据组。其查找信息中的相似性和差异的能力使其非常适合探索性数据分析、交叉销售策略、客户细分和图像识别。

Unsupervised learning − Unsupervised learning, also called unsupervised machine learning, uses machine learning algorithms to look at unlabelled datasets and group them together. These programmes find hidden patterns or groups of data. Its ability to find similarities and differences in information makes it perfect for exploratory data analysis, cross-selling strategies, customer segmentation, and image recognition.

Unsupervised Learning Algorithms - K 均值聚类

Unsupervised Learning Algorithms − K-means clustering

Q4. What is regularization and how does it help to avoid overfitting?


Regularization is a method that adds information to a model to stop it from becoming overfitted. It is a type of regression that tries to get the estimates of the coefficients as close to zero as possible to make the model smaller. In this case, taking away extra weights is what it means to reduce a model’s capacity.


Regularization takes away any extra weights from the chosen features and redistributes the weights so that they are all the same. This means that regularisation makes it harder to learn a model that is both flexible and has a lot of moving parts. A model with a lot of flexibility is one that can fit as many data points as possible.

Q5. What is cross-validation and why is it important in machine learning?


Cross-validation is a technique to test ML models by training them on different subsets of the available input data and then testing them on the other subset. We can use cross-validation to detect overfitting, ie, failing to generalise a pattern.

对于交叉验证,我们可以使用 k 折交叉验证法。在 k 折交叉验证中,我们将初始数据分为 k 组(也称为折叠)。我们在所有子集减去一个子集(k-1)上训练机器学习模型,然后在未用于训练的子集上测试该模型。此过程执行 k 次,每次将不同的子集留作评估(不用于训练)。

For cross-validation, we can use the k-fold cross-validation method. In k-fold cross-validation, we divide the data you start with into k groups (also known as folds). We train an ML model on all but one (k-1) of the subsets, and then we test the model on the subset that wasn’t used for training. This process is done k times, and each time a different subset is set aside for evaluation (and not used for training).

Q6. What is the difference between classification and regression in machine learning?


The major difference between regression and classification is that regression helps predict a continuous quantity, while classification helps predict discrete class labels. Some components of the two kinds of machine learning algorithms are also the same.


A regression algorithm can make a prediction about a discrete value, which is a whole number.


If the value is in the form of a class label probability, a classification algorithm can predict this type of data.


Clustering is a method for data mining that organises unlabelled data based on their similarities or differences. Clustering techniques are used to organise unclassified, unprocessed data items into groups according to structures or patterns in the data. There are many types of clustering algorithms, including exclusive, overlapping, hierarchical, and probabilistic.

K-means clustering 是聚类方法的一个流行示例,其中根据数据点与每个组质心的距离将数据点分配到 K 组中。最接近某个质心的数据点将被分组到同一类别中。较高的 K 值表示具有更高粒度的较小组,而较低的 K 值表示具有较低粒度的较大组。K 均值聚类的常见应用包括市场细分、文档聚类、图像细分和图像压缩。

K-means clustering is a popular example of a clustering approach in which data points are allocated to K groups based on their distance from each group’s centroid. The data points closest to a certain centroid will be grouped into the same category. A higher K number indicates smaller groups with more granularity, while a lower K value indicates bigger groupings with less granularity. Common applications of K-means clustering include market segmentation, document clustering, picture segmentation, and image compression.

Q8. What is gradient descent and how does it work in machine learning?

梯度下降是一种优化算法,通常用于训练神经网络和机器学习模型。训练数据帮助这些模型随着时间的推移进行学习,而梯度下降中的成本函数作为一个晴雨表来衡量每次参数更新的准确率。该模型将继续改变其参数以尽可能减小误差,直到函数接近或等于 0。一旦机器学习模型调整到尽可能准确,则可以使用其以强大的方式用于人工智能 (AI) 和计算机科学。

Gradient descent is an optimisation algorithm that is often used to train neural networks and machine learning models. Training data helps these models learn over time, and the cost function in gradient descent acts as a barometer to measure how accurate it is with each iteration of parameter updates. The model will keep changing its parameters to make the error as small as possible until the function is close to or equal to 0. Once machine learning models are tuned to be as accurate as possible, they can be used in artificial intelligence (AI) and computer science in powerful ways.

Q9. What is A/B testing and how can it be used in data science?

A/B 测试是一种常见的随机对照实验。这是一种在受控环境中确定两个变量版本中哪一个执行得更好的方法。A/B 测试是数据科学和整个技术行业中最重要的概念之一,因为它是就任何假设得出结论的最有效方法之一。理解什么是 A/B 测试以及它通常如何工作至关重要。A/B 测试是评估商品的常用方法,并且在数据分析领域势头正盛。在测试增量更改(例如 UX 修改、新功能、排名和页面加载速度)时,A/B 测试更有效。

A/B testing is a common form of randomised controlled experiment. It is a method for determining which of two versions of a variable performs better in a controlled setting. A/B testing is one of the most important concepts in data science and the technology industry as a whole since it is one of the most efficient approaches for drawing conclusions regarding any hypothesis. It is essential that you comprehend what A/B testing is and how it normally works. A/B testing is a common method for evaluating goods and is gaining momentum in the area of data analytics. A/B testing is more effective when testing incremental changes such as UX modifications, new features, ranking, and page load speeds.

Q10. Can you explain overfitting and underfitting, and how to mitigate them?


Overfitting is a modelling error that arises when a function is overfit to a restricted number of data points. It is the outcome of a model with an excessive amount of training points and excessive complexity.


Underfitting is a modelling error that arises when a function does not properly match the data points. That is the outcome of a simple model with inadequate training points.


There are a number of ways that researchers in machine learning can avoid overfitting. These include: Cross-validation, Regularization, Pruning, Dropout.


There are a number of ways that researchers in machine learning can avoid underfitting. These include −

  1. Get more training data.

  2. Add more parameters or increase size of the parameters.

  3. Make the model more complex.

  4. Adding more time to training until the cost function is at its lowest.


With these methods, you should be able to make your models better and fix any problems with overfitting or underfitting.