Data Science 简明教程

Data Science - Interview Questions

以下是面试中一些最常问的问题。

Below are some most commonly asked questions in the interviews.

数据科学是一个研究领域,使用计算和统计方法从数据中获取知识和见解。它利用数学、统计、计算机科学和特定领域知识来分析大数据集,从数据中发现趋势和模式,并对未来进行预测。

Data Science is the domain of study that uses computational and statistical methods to get knowledge and insights from data. It utilizes techniques from mathematics, statistics, computer science and domain-specific knowledge to analyse large datasets, find trends and patterns from the data and make predictions for the future.

数据科学与其他数据相关领域不同,因为它不仅仅是收集和组织数据。数据科学过程包括分析、建模、可视化和评估数据集。数据科学使用机器学习算法、数据可视化工具和统计模型来分析数据,做出预测,并在数据中找到模式和趋势。

Data Science is different from other data related fields because it is not only about collecting and organising data. The data science process consists of analysing, modelling, visualizing and evaluating the data set. Data Science uses tools like machine learning algorithms, data visualisation tools and statistical models to analyse data, make predictions and find patterns and trends in the data.

其他数据相关领域,例如机器学习、数据工程和数据分析,更专注于特定事物,例如机器学习工程师的目标是设计和创建能够从数据中学习并做出预测的算法,数据工程的目标是设计和管理数据管道、基础设施和数据库。数据分析完全是关于探索和分析数据以找到模式和趋势。而数据科学则对模型进行建模、探索、收集、可视化、预测和部署。

Other data related fields such as machine learning, data engineering and data analytics are more focused on a particular thing like the goal of a machine leaning engineer is to design and create algorithms that are capable of learning from the data and making predictions, the goal of data engineering is to design and manage data pipelines, infrastructures and databases. Data analysis is all about exploring and analysing data to find patterns and trends. Whereas data science does modelling, exploring, collecting, visualizing, predicting, and deploying the model.

总体而言,数据科学是一种更全面的数据分析方式,因为它包含从准备数据到做出预测的整个过程。处理数据​​的其他领域有更具体的专业领域。

Overall, data science is a more comprehensive way to analyse data because it includes the whole process, from preparing the data to making predictions. Other fields that deal with data have more specific areas of expertise.

Q2. What is the data science process and what are the key steps involved?

数据科学流程也称为数据科学生命周期,是一种解决数据问题的系统方法,它展示了为开发、交付和维护数据科学项目而采取的步骤。

A data science process also known as data science lifecycle is a systematic approach to find a solution for a data problem which shows the steps that are taken to develop, deliver, and maintain a data science project.

标准数据科学生命周期方法包括使用机器学习算法和统计程序,从而产生更准确的预测模型。数据提取、准备、清理、建模、评估等是一些最重要的数据科学阶段。数据科学流程涉及的关键步骤有:

A standard data science lifecycle approach comprises the use of machine learning algorithms and statistical procedures that result in more accurate prediction models. Data extraction, preparation, cleaning, modelling, assessment, etc., are some of the most important data science stages. Key steps involved in data science process are −

Identifying Problem and Understanding the Business

数据科学生命周期与任何其他业务生命周期一样,都始于“为什么”。数据科学流程中最重要的部分之一是找出问题所在。这有助于找到明确的目标,所有其他步骤都可以围绕该目标进行。简而言之,了解业务目标非常重要,因为它将确定分析的最终目标。

The data science lifecycle starts with "why?" just like any other business lifecycle. One of the most important parts of the data science process is figuring out what the problems are. This helps to find a clear goal around which all the other steps can be made. In short, it’s important to know the business goal as earliest because it will determine what the end goal of the analysis will be.

Data Collection

数据科学生命周期的下一步是数据收集,这意味着从适当且可靠的来源获取原始数据。收集的数据可以是有序的,也可以是无序的。数据可以从网站日志、社交媒体数据、在线数据存储库中收集,甚至可以使用 API、网络抓取或可能存在于 Excel 或其他来源中的数据从在线来源流式传输数据。

The next step in the data science lifecycle is data collection, which means getting raw data from the appropriate and reliable source. The data that is collected can be either organized or unorganized. The data could be collected from website logs, social media data, online data repositories, and even data that is streamed from online sources using APIs, web scraping, or data that could be in Excel or any other source.

Data Processing

从可靠来源收集高质量数据后,下一步是处理它。数据处理的目的是确保在进入下一阶段之前已经解决了所获取数据的任何问题。如果没有此步骤,我们可能会产生错误或不准确的发现。

After collecting high-quality data from reliable sources, next step is to process it. The purpose of data processing is to ensure that any problems with the acquired data have been resolved before proceeding to the next phase. Without this step, we may produce mistakes or inaccurate findings.

Data Analysis

数据分析 探索性数据分析 (EDA) 是一组用于分析数据的可视化技术。采用此方法,我们可能会获取有关数据统计摘要的具体详细信息。此外,我们将能够处理重复数字、离群值,并在集合内找出趋势或模式。

Data analysis Exploratory Data Analysis (EDA) is a set of visual techniques for analysing data. With this method, we may get specific details on the statistical summary of the data. Also, we will be able to deal with duplicate numbers, outliers, and identify trends or patterns within the collection.

Data Visualization

数据可视化是在图表上展示信息和数据的过程。利用图表、图形和地图之类的直观元素,数据可视化工具可以轻松了解数据中的趋势、异常值和模式。对于员工或企业所有者而言,这也是一种向不精通技术的人员展示数据而无需让他们感到困惑的出色方法。

Data visualisation is the process of demonstrating information and data on a graph. Data visualisation tools make it easy to understand trends, outliers, and patterns in data by using visual elements like charts, graphs, and maps. It’s also a great way for employees or business owners to present data to people who aren’t tech-savvy without making them confused.

Data Modelling

数据建模是数据科学中最重要的方面之一,有时被称为数据分析的核心。模型的目标输出应源自已准备好的经过分析的数据。

Data Modelling is one of the most important aspects of data science and is sometimes referred to as the core of data analysis. The intended output of a model should be derived from prepared and analysed data.

在这个阶段,我们将为生产相关任务开发数据集以对模型进行训练和测试。它还涉及选择正确的模式类型并确定问题是否涉及分类、回归或聚类。在分析模型类型后,我们必须选择适当的实现算法。必须仔细执行,因为从提供的数据中提取相关见解至关重要。

At this phase, we develop datasets for training and testing the model for production-related tasks. It also involves selecting the correct mode type and determining if the problem involves classification, regression, or clustering. After analysing the model type, we must choose the appropriate implementation algorithms. It must be performed with care, as it is crucial to extract the relevant insights from the provided data.

Model Deployment

模型部署包含为将模型部署到市场消费者或另一系统而建立的交付方法。机器学习模型也正在设备上实施,并且越来越受欢迎。根据项目的复杂性,此阶段的范围可能从 Tableau 仪表盘上的基本模型输出到拥有数百万用户的复杂的基于云的部署。

Model deployment contains the establishment of a delivery method necessary to deploy the model to market consumers or to another system. Machine learning models are also being implemented on devices and gaining acceptance and appeal. Depending on the complexity of the project, this stage might range from a basic model output on a Tableau Dashboard to a complicated cloud-based deployment with millions of users.

Q3. What is the difference between supervised and unsupervised learning?

监督学习 - 监督学习是一种机器学习和人工智能。它也被称为“监督机器学习”。其定义为使用标记数据集来培训算法如何正确分类数据或预测结果。在将数据放入模型时,模型的权重会发生变化,直到模型正确拟合为止。这是交叉验证过程的一部分。监督学习可以帮助组织找到解决各种实际问题的规模化解决方案,例如在 Gmail 中将垃圾邮件分类到与收件箱分开的文件夹中,我们在 Gmail 中有一个垃圾邮件文件夹。

*Supervised Learning * − Supervised learning is a type of machine learning and artificial intelligence. It is also called "supervised machine learning." It is defined by the fact that it uses labelled datasets to train algorithms how to correctly classify data or predict outcomes. As data is put into the model, its weights are changed until the model fits correctly. This is part of the cross validation process. Supervised learning helps organisations find large-scale solutions to a wide range of real-world problems, like classifying spam in a separate folder from your inbox like in Gmail we have a spam folder.

Supervised Learning Algorithms - 朴素贝叶斯、线性回归、逻辑回归。

Supervised Learning Algorithms − Naive Bayes, Linear regression, Logistic regression.

Unsupervised learning - 无监督学习也称为无监督机器学习,使用机器学习算法查看未标记数据集并将其分组在一起。这些程序查找隐藏的模式或数据组。其查找信息中的相似性和差异的能力使其非常适合探索性数据分析、交叉销售策略、客户细分和图像识别。

Unsupervised learning − Unsupervised learning, also called unsupervised machine learning, uses machine learning algorithms to look at unlabelled datasets and group them together. These programmes find hidden patterns or groups of data. Its ability to find similarities and differences in information makes it perfect for exploratory data analysis, cross-selling strategies, customer segmentation, and image recognition.

Unsupervised Learning Algorithms - K 均值聚类

Unsupervised Learning Algorithms − K-means clustering

Q4. What is regularization and how does it help to avoid overfitting?

正则化是一种向模型添加信息的方法,以防止模型过度拟合。它是一种回归,旨在尽可能使系数的估计值接近零以缩小模型。在这种情况下,减少模型容量意味着去除额外的权重。

Regularization is a method that adds information to a model to stop it from becoming overfitted. It is a type of regression that tries to get the estimates of the coefficients as close to zero as possible to make the model smaller. In this case, taking away extra weights is what it means to reduce a model’s capacity.

正则化从所选特征中去除任何额外权重,并重新分配权重,使其全部相同。这意味着正则化增加了学习既灵活又有很多活动部件的模型的难度。能够拟合尽可能多数据点的模型就是灵活的模型。

Regularization takes away any extra weights from the chosen features and redistributes the weights so that they are all the same. This means that regularisation makes it harder to learn a model that is both flexible and has a lot of moving parts. A model with a lot of flexibility is one that can fit as many data points as possible.

Q5. What is cross-validation and why is it important in machine learning?

交叉验证是一种通过在可用输入数据的不同子集上对其进行训练并针对另一个子集对其进行测试来测试机器学习模型的技术。我们可以使用交叉验证来检测过度拟合,即无法概括模式。

Cross-validation is a technique to test ML models by training them on different subsets of the available input data and then testing them on the other subset. We can use cross-validation to detect overfitting, ie, failing to generalise a pattern.

对于交叉验证,我们可以使用 k 折交叉验证法。在 k 折交叉验证中,我们将初始数据分为 k 组(也称为折叠)。我们在所有子集减去一个子集(k-1)上训练机器学习模型,然后在未用于训练的子集上测试该模型。此过程执行 k 次,每次将不同的子集留作评估(不用于训练)。

For cross-validation, we can use the k-fold cross-validation method. In k-fold cross-validation, we divide the data you start with into k groups (also known as folds). We train an ML model on all but one (k-1) of the subsets, and then we test the model on the subset that wasn’t used for training. This process is done k times, and each time a different subset is set aside for evaluation (and not used for training).

Q6. What is the difference between classification and regression in machine learning?

回归和分类之间的主要区别在于,回归有助于预测连续量,而分类有助于预测离散类标签。两种机器学习算法的一些组件也相同。

The major difference between regression and classification is that regression helps predict a continuous quantity, while classification helps predict discrete class labels. Some components of the two kinds of machine learning algorithms are also the same.

回归算法可以对离散值(全整数)进行预测。

A regression algorithm can make a prediction about a discrete value, which is a whole number.

如果值采用类别标签概率的形式,那么分类算法可以预测此类数据。

If the value is in the form of a class label probability, a classification algorithm can predict this type of data.

聚类是一种基于数据的相似性或差异组织未标记数据的数据挖掘方法。聚类技术用于根据数据中的结构或模式将未分类的未处理数据项组织到组中。有许多类型的聚类算法,包括排他性的、重叠的、层次的和概率性的。

Clustering is a method for data mining that organises unlabelled data based on their similarities or differences. Clustering techniques are used to organise unclassified, unprocessed data items into groups according to structures or patterns in the data. There are many types of clustering algorithms, including exclusive, overlapping, hierarchical, and probabilistic.

K-means clustering 是聚类方法的一个流行示例,其中根据数据点与每个组质心的距离将数据点分配到 K 组中。最接近某个质心的数据点将被分组到同一类别中。较高的 K 值表示具有更高粒度的较小组,而较低的 K 值表示具有较低粒度的较大组。K 均值聚类的常见应用包括市场细分、文档聚类、图像细分和图像压缩。

K-means clustering is a popular example of a clustering approach in which data points are allocated to K groups based on their distance from each group’s centroid. The data points closest to a certain centroid will be grouped into the same category. A higher K number indicates smaller groups with more granularity, while a lower K value indicates bigger groupings with less granularity. Common applications of K-means clustering include market segmentation, document clustering, picture segmentation, and image compression.

Q8. What is gradient descent and how does it work in machine learning?

梯度下降是一种优化算法,通常用于训练神经网络和机器学习模型。训练数据帮助这些模型随着时间的推移进行学习,而梯度下降中的成本函数作为一个晴雨表来衡量每次参数更新的准确率。该模型将继续改变其参数以尽可能减小误差,直到函数接近或等于 0。一旦机器学习模型调整到尽可能准确,则可以使用其以强大的方式用于人工智能 (AI) 和计算机科学。

Gradient descent is an optimisation algorithm that is often used to train neural networks and machine learning models. Training data helps these models learn over time, and the cost function in gradient descent acts as a barometer to measure how accurate it is with each iteration of parameter updates. The model will keep changing its parameters to make the error as small as possible until the function is close to or equal to 0. Once machine learning models are tuned to be as accurate as possible, they can be used in artificial intelligence (AI) and computer science in powerful ways.

Q9. What is A/B testing and how can it be used in data science?

A/B 测试是一种常见的随机对照实验。这是一种在受控环境中确定两个变量版本中哪一个执行得更好的方法。A/B 测试是数据科学和整个技术行业中最重要的概念之一,因为它是就任何假设得出结论的最有效方法之一。理解什么是 A/B 测试以及它通常如何工作至关重要。A/B 测试是评估商品的常用方法,并且在数据分析领域势头正盛。在测试增量更改(例如 UX 修改、新功能、排名和页面加载速度)时,A/B 测试更有效。

A/B testing is a common form of randomised controlled experiment. It is a method for determining which of two versions of a variable performs better in a controlled setting. A/B testing is one of the most important concepts in data science and the technology industry as a whole since it is one of the most efficient approaches for drawing conclusions regarding any hypothesis. It is essential that you comprehend what A/B testing is and how it normally works. A/B testing is a common method for evaluating goods and is gaining momentum in the area of data analytics. A/B testing is more effective when testing incremental changes such as UX modifications, new features, ranking, and page load speeds.

Q10. Can you explain overfitting and underfitting, and how to mitigate them?

过度拟合是在函数过度拟合到有限数量的数据点时出现的建模错误。它是训练点过多和复杂性过高的模型的结果。

Overfitting is a modelling error that arises when a function is overfit to a restricted number of data points. It is the outcome of a model with an excessive amount of training points and excessive complexity.

欠拟合是一种建模错误,当一个函数与数据点不完全匹配时就会出现。这是简单的模型与训练点不足的结果。

Underfitting is a modelling error that arises when a function does not properly match the data points. That is the outcome of a simple model with inadequate training points.

机器学习研究者可以通过多种方式避免过拟合。这些方式包括:交叉验证、正则化、剪枝、丢弃。

There are a number of ways that researchers in machine learning can avoid overfitting. These include: Cross-validation, Regularization, Pruning, Dropout.

机器学习研究者可以通过多种方式避免欠拟合。这些方式包括:

There are a number of ways that researchers in machine learning can avoid underfitting. These include −

  1. Get more training data.

  2. Add more parameters or increase size of the parameters.

  3. Make the model more complex.

  4. Adding more time to training until the cost function is at its lowest.

使用这些方法,你应能使模型变得更好,并修复任何过拟合或欠拟合问题。

With these methods, you should be able to make your models better and fix any problems with overfitting or underfitting.