Data Science 简明教程
Data Science - Machine Learning
机器学习使机器能够从数据中自动学习、从经验中提高性能并预测事物,而无需明确编程。机器学习主要涉及开发算法,让计算机能够自行从数据和过去的经验中学习。机器学习一词最早由阿瑟·塞缪尔在 1959 年提出。
Machine learning enables a machine to automatically learn from data, improve performance from experiences, and predict things without being explicitly programmed. Machine Learning is mainly concerned with the development of algorithms which allow a computer to learn from the data and past experiences on their own. The term machine learning was first introduced by Arthur Samuel in 1959.
数据科学是一门从数据中获取有用见解的科学,以便获取最关键和最相关的的信息来源。并在给定可靠数据流的情况下,使用机器学习生成预测。
Data Science is the science of gaining useful insights from data in order to get the most crucial and relevant information source. And given a dependable stream of data, generating predictions using machine learning.
数据科学和机器学习是计算机科学的子领域,重点在于分析和利用大量数据,以改进产品、服务、基础设施系统等的开发和向市场推出这些产品的流程。
Data Science and machine learning are subfields of computer science that focus on analyzing and making use of large amounts of data to improve the processes by which products, services, infrastructural systems, and more are developed and introduced to the market.
两者之间的关系类似于正方形是矩形,但矩形不是正方形。数据科学是包罗万象的矩形,而机器学习则是正方形,是它自己的实体。它们都是数据科学家在其工作中常用的,并且越来越受到几乎所有企业的接受。
The two relate to each other in a similar manner that squares are rectangles, but rectangles are not squares. Data Science is the all-encompassing rectangle, while machine learning is a square that is its own entity. They are both commonly employed by data scientists in their job and are increasingly being accepted by practically every business.
What is Machine Learning?
机器学习 (ML) 是一种算法,它让软件能够更准确地预测未来会发生什么,而无需专门编程来执行此操作。机器学习的基本思想是制定算法,让其可以将数据作为输入,并使用统计分析来预测输出,并且随着新数据的出现,还会更新输出。
Machine learning (ML) is a type of algorithm that lets software get more accurate at predicting what will happen in future without being specifically programmed to do so. The basic idea behind machine learning is to make algorithms that can take data as input and use statistical analysis to predict an output while also updating outputs as new data becomes available.
机器学习是使用算法在数据中查找模式,然后预测这些模式在未来如何变化的人工智能的一部分。这使工程师能够使用统计分析来查找数据中的模式。
Machine learning is a part of artificial intelligence that uses algorithms to find patterns in data and then predict how those patterns will change in the future. This lets engineers use statistical analysis to look for patterns in the data.
Facebook、Twitter、Instagram、YouTube 和 TikTok 基于你过去的行为收集有关其用户的信息,它可以猜测你的兴趣和要求,并推荐适合你需要的产品、服务或文章。
Facebook, Twitter, Instagram, YouTube, and TikTok collect information about their users, based on what you’ve done in the past, it can guess your interests and requirements and suggest products, services, or articles that fit your needs.
机器学习是一组工具和概念,用于数据科学,但它们也出现在其他领域。数据科学家通常在他们的工作中使用机器学习,以帮助他们更快地获取更多信息或找出趋势。
Machine learning is a set of tools and concepts that are used in data science, but they also show up in other fields. Data scientists often use machine learning in their work to help them get more information faster or figure out trends.
Types of Machine Learning
机器学习可以分为三种类型的算法——
Machine learning can be classified into three types of algorithms −
-
Supervised learning
-
Unsupervised learning
-
Reinforcement learning
Supervised Learning
监督式学习是一种机器学习和人工智能。它也被称为“监督式机器学习”。它的定义是它使用标记数据集来训练算法如何正确分类数据或预测结果。当数据被放入模型时,其权重会发生变化,直到模型正确合适。这是交叉验证过程的一部分。监督式学习帮助组织为广泛的现实世界问题找到大规模解决方案,例如像 Gmail 中将垃圾邮件分类到与收件箱分开的文件夹一样,我们有一个垃圾邮件文件夹。
Supervised learning is a type of machine learning and artificial intelligence. It is also called "supervised machine learning." It is defined by the fact that it uses labelled datasets to train algorithms how to correctly classify data or predict outcomes. As data is put into the model, its weights are changed until the model fits correctly. This is part of the cross validation process. Supervised learning helps organisations find large-scale solutions to a wide range of real-world problems, like classifying spam in a separate folder from your inbox like in Gmail we have a spam folder.
Supervised Learning Algorithms
一些监督式学习算法有——
Some supervised learning algorithms are −
-
Naive Bayes − Naive Bayes is a classification algoritm that is based on the Bayes Theorem’s principle of class conditional independence. This means that the presence of one feature doesn’t change the likelihood of another feature, and that each predictor has the same effect on the result/outcome.
-
Linear Regression − Linear regression is used to find how a dependent variable is related to one or more independent variables and to make predictions about what will happen in the future. Simple linear regression is when there is only one independent variable and one dependent variable.
-
Logistic Regression − When the dependent variables are continuous, linear regression is used. When the dependent variables are categorical, like "true" or "false" or "yes" or "no," logistic regression is used. Both linear and logistic regression seek to figure out the relationships between the data inputs. However, logistic regression is mostly used to solve binary classification problems, like figuring out if a particular mail is a spam or not.
-
Support Vector Machines(SVM) − A support vector machine is a popular model for supervised learning developed by Vladimir Vapnik. It can be used to both classify and predict data. So, it is usually used to solve classification problems by making a hyperplane where the distance between two groups of data points is the greatest. This line is called the "decision boundary" because it divides the groups of data points (for example, oranges and apples) on either side of the plane.
-
K-nearest Neighbour − The KNN algorithm, which is also called the "k-nearest neighbour" algorithm, groups data points based on how close they are to and related to other data points. This algorithm works on the idea that data points that are similar can be found close to each other. So, it tries to figure out how far apart the data points are, using Euclidean distance and then assigns a category based on the most common or average category. However, as the size of the test dataset grows, the processing time increases, making it less useful for classification tasks.
-
Random Forest − Random forest is another supervised machine learning algorithm that is flexible and can be used for both classification and regression. The "forest" is a group of decision trees that are not correlated to each other. These trees are then combined to reduce variation and make more accurate data predictions.
Unsupervised Learning
无监督学习(也称为无监督机器学习)使用机器学习算法查看未标记数据集并将其分组在一起。这些程序可以找到隐藏的模式或数据组。它查找信息中相似性和差异性的能力使其非常适合探索性数据分析、交叉销售策略、客户细分和图像识别。
Unsupervised learning, also called unsupervised machine learning, uses machine learning algorithms to look at unlabelled datasets and group them together. These programmes find hidden patterns or groups of data. Its ability to find similarities and differences in information makes it perfect for exploratory data analysis, cross-selling strategies, customer segmentation, and image recognition.
Common Unsupervised Learning Approaches
无监督学习模型用于以下三个主要任务:聚类、建立连接和降低维度。下面,我们将介绍学习方法和常用算法 -
Unsupervised learning models are used for three main tasks: clustering, making connections, and reducing the number of dimensions. Below, we’ll describe learning methods and common algorithms used −
Clustering - 聚类是一种数据挖掘方法,可根据相似性或差异性对未标记数据进行组织。聚类技术用于根据数据中的结构或模式将未分类、未经处理的数据项组织到组中。聚类算法有很多类型,包括排他、重叠、层次和概率。
Clustering − Clustering is a method for data mining that organises unlabelled data based on their similarities or differences. Clustering techniques are used to organise unclassified, unprocessed data items into groups according to structures or patterns in the data. There are many types of clustering algorithms, including exclusive, overlapping, hierarchical, and probabilistic.
K-means Clustering 是聚类方法的一个流行示例,其中数据点根据到每个组的质心的距离分配到 K 组。最接近某个质心的数据点将被归入同一类别。较高的 K 值表示具有更多粒度的较小组,而较低的 K 值表示具有较少粒度的较大组。K 均值聚类的常见应用包括市场细分、文档聚类、图像分割和图像压缩。
K-means Clustering is a popular example of an clustering approach in which data points are allocated to K groups based on their distance from each group’s centroid. The data points closest to a certain centroid will be grouped into the same category. A higher K number indicates smaller groups with more granularity, while a lower K value indicates bigger groupings with less granularity. Common applications of K-means clustering include market segmentation, document clustering, picture segmentation, and image compression.
Dimensionality Reduction - 尽管更多的数据通常会产生更准确的发现,但它也可能影响机器学习算法的有效性(例如,过拟合)并使数据集难以可视化。降维是一种在数据集具有过多特征或维度时使用的策略。它将数据输入量减少到可管理的水平,同时尽可能保持数据集的完整性。降维通常应用于数据预处理阶段,有很多方法,其中之一就是 -
Dimensionality Reduction − Although more data typically produces more accurate findings, it may also affect the effectiveness of machine learning algorithms (e.g., overfitting) and make it difficult to visualize datasets. Dimensionality reduction is a strategy used when a dataset has an excessive number of characteristics or dimensions. It decreases the quantity of data inputs to a manageable level while retaining the integrity of the dataset to the greatest extent feasible. Dimensionality reduction is often employed in the data pre-processing phase, and there are a number of approaches, one of them is −
Principal Component Analysis (PCA) - 这是通过特征提取消除冗余和压缩数据集的降维方法。此方法采用线性变换来生成新的数据表示,从而产生一组“主成分”。第一个主成分是使方差最大化的数据集方向。尽管第二个主成分同样在数据中找到了最大的方差,但它与第一个完全不相关,从而产生了与第一个正交的方向。此过程根据维度的数量重复,下一个主分量是与最可变前一个分量正交的方向。
Principal Component Analysis (PCA) − It is a dimensionality reduction approach used to remove redundancy and compress datasets through feature extraction. This approach employs a linear transformation to generate a new data representation, resulting in a collection of "principal components." The first principal component is the dataset direction that maximises variance. Although the second principal component similarly finds the largest variance in the data, it is fully uncorrelated with the first, resulting in a direction that is orthogonal to the first. This procedure is repeated dependent on the number of dimensions, with the next main component being the direction orthogonal to the most variable preceding components.
Reinforcement Learning
强化学习 (RL) 是一种机器学习,它允许代理通过反复试验和利用其自身行为和经验的反馈在交互式环境中学习。
Reinforcement Learning (RL) is a type of machine learning that allows an agent to learn in an interactive setting via trial and error utilising feedback from its own actions and experiences.
Key terms in Reinforcement Learning
一些描述 RL 问题基本组件的重要概念有 -
Some significant concepts describing the fundamental components of an RL issue are −
-
Environment − The physical surroundings in which an agent functions
-
Condition − The current standing of the agent
-
Reward − Environment-based feed-back
-
Policy − Mapping between agent state and actions
-
Value − The future compensation an agent would obtain for doing an action in a given condition.
Data Science vs Machine Learning
数据科学是对数据的研究以及如何从中得出有意义的见解,而机器学习是对使用数据来提高性能或为预测提供信息的模型的研究和开发。机器学习是人工智能的一个子领域。
Data Science is the study of data and how to derive meaningful insights from it, while machine learning is the study and development of models that use data to enhance performance or inform predictions. Machine learning is a subfield of artificial intelligence.
近年来,机器学习和人工智能 (AI) 已开始主导数据科学的部分领域,在数据分析和商业智能中发挥着至关重要的作用。机器学习通过使用模型和算法收集和分析有关特定人群的巨量数据,自动执行数据分析并根据这些数据进行预测。数据科学和机器学习是相关的,但并不相同。
In recent years, machine learning and artificial intelligence (AI) have come to dominate portions of data science, playing a crucial role in data analytics and business intelligence. Machine learning automates data analysis and makes predictions based on the collection and analysis of massive volumes of data about certain populations using models and algorithms. Data Science and machine learning are related to each other, but not identical.
数据科学是一个广阔的领域,涵盖从数据中获取见解和信息的所有方面。它涉及收集、清理、分析和解释大量数据,以发现可能指导业务决策的模式、趋势和见解。
Data Science is a vast field that incorporates all aspects of deriving insights and information from data. It involves gathering, cleaning, analysing, and interpreting vast amount of data to discover patterns, trends, and insights that may guide business choices.
机器学习是数据科学的一个子领域,它专注于开发可以从数据中学习并根据其获取的知识进行预测或判断的算法。机器学习算法旨在通过获取新知识自动随着时间的推移提高其性能。
Machine learning is a subfield of data science that focuses on the development of algorithms that can learn from data and make predictions or judgements based on their acquired knowledge. Machine learning algorithms are meant to enhance their performance automatically over time by acquiring new knowledge.
换句话说,数据科学包含机器学习作为其众多方法之一。机器学习是数据分析和预测的有力工具,但它只是整个数据科学的一个子领域。
In other words, data science encompasses machine learning as one of its numerous methodologies. Machine learning is a strong tool for data analysis and prediction, but it is just a subfield of data science as a whole.
下面是对比表,以清晰理解。
Given below is the table of comparison for a clear understanding.
Data Science |
Machine Learning |
Data Science is a broad field that involves the extraction of insights and knowledge from large and complex datasets using various techniques, including statistical analysis, machine learning, and data visualization. |
Machine learning is a subset of data science that involves defining and developing algorithms and models that enable machines to learn from data and make predictions or decisions without being explicitly programmed. |
Data Science focuses on understanding the data, identifying patterns and trends, and extracting insights to support decision-making. |
Machine learning, on the other hand, focuses on building predictive models and making decisions based on the learned patterns. |
Data Science includes a wide range of techniques, such as data cleaning, data integration, data exploration, statistical analysis, data visualization, and machine learning. |
Machine learning, on the other hand, primarily focuses on building predictive models using algorithms such as regression, classification, and clustering. |
Data Science typically requires large and complex datasets that require significant processing and cleaning to derive insights. |
Machine learning, on the other hand, requires labelled data that can be used to train algorithms and models. |
Data Science requires skills in statistics, programming, and data visualization, as well as domain knowledge in the area being studied. |
Machine learning requires a strong understanding of algorithms, programming, and mathematics, as well as a knowledge of the specific application area. |
Data Science techniques can be used for a variety of purposes beyond prediction, such as clustering, anomaly detection, and data visualization |
Machine learning algorithms are primarily focused on making predictions or decisions based on data |
Data Science often relies on statistical methods to analyze data, |
Machine learning relies on algorithms to make predictions or decisions. |