Python Data Science 简明教程

Data Science Python - Getting Started

What is Data Science ?

数据科学是通过组织、处理和分析数据从庞大且多样化的数据集获取知识和见解的过程。它涉及许多不同的学科,如数学和统计建模、从来源提取数据以及应用数据可视化技术。通常,它还涉及处理大数据技术以收集结构化和非结构化数据。下面我们将看到数据科学得到应用的一些示例场景。

Data science is the process of deriving knowledge and insights from a huge and diverse set of data through organizing, processing and analysing the data. It involves many different disciplines like mathematical and statistical modelling, extracting data from it source and applying data visualization techniques. Often it also involves handling big data technologies to gather both structured and unstructured data. Below we will see some example scenarios where Data science is used.

Recommendation systems

随着在线购物变得更加普遍,电子商务平台能够捕捉用户购物偏好以及市场上各种产品的性能。这导致创建推荐系统,这些系统创建模型来预测购物者的需求,并显示购物者最有可能购买的产品。

As online shopping becomes more prevalent, the e-commerce platforms are able to capture users shopping preferences as well as the performance of various products in the market. This leads to creation of recommendation systems which create models predicting the shoppers needs and show the products the shopper is most likely to buy.

Financial Risk management

通过使用客户的过去支出习惯、过去的违约、其他财务承诺和许多社会经济指标,可以更好地分析涉及贷款和信贷的金融风险。这些数据以不同格式从各种来源收集。将它们组织在一起并深入了解客户特征需要数据科学的帮助。结果是通过避免坏账来最大程度地减少金融组织的损失。

The financial risk involving loans and credits are better analysed by using the customers past spend habits, past defaults, other financial commitments and many socio-economic indicators. These data is gathered from various sources in different formats. Organising them together and getting insight into customers profile needs the help of Data science. The outcome is minimizing loss for the financial organization by avoiding bad debt.

Improvement in Health Care services

医疗保健行业处理着各种数据,这些数据可归类为技术数据、财务数据、患者信息、药物信息和法律规则。所有这些数据都需要以协调的方式进行分析,以产生既能为医疗保健提供者节省成本,又能为医疗保健接收者提供便利,同时保持法律合规性的见解。

The health care industry deals with a variety of data which can be classified into technical data, financial data, patient information, drug information and legal rules. All this data need to be analysed in a coordinated manner to produce insights that will save cost both for the health care provider and care receiver while remaining legally compliant.

Computer Vision

通过计算机识别图像的进步涉及处理来自同一类别的多个对象的大量图像数据。例如,面部识别。对这些数据集进行建模,并创建算法将该模型应用于更新的图像以获得令人满意的结果。处理这些大量数据集和创建模型需要数据科学中使用的各种工具。

The advancement in recognizing an image by a computer involves processing large sets of image data from multiple objects of same category. For example, Face recognition. These data sets are modelled, and algorithms are created to apply the model to newer images to get a satisfactory result. Processing of these huge data sets and creation of models need various tools used in Data science.

Efficient Management of Energy

随着能源消耗需求的激增,能源生产公司需要更有效地管理能源生产和分配的各个阶段。这涉及优化生产方法、储存和分配机制,以及研究客户的消费模式。将来自所有这些来源的数据联系起来并获得见解似乎是一项艰巨的任务。通过使用数据科学工具,这变得更加容易。

As the demand for energy consumption soars, the energy producing companies need to manage the various phases of the energy production and distribution more efficiently. This involves optimizing the production methods, the storage and distribution mechanisms as well as studying the customers consumption patterns. Linking the data from all these sources and deriving insight seems a daunting task. This is made easier by using the tools of data science.

Python in Data Science

数据科学的编程要求需要一门非常通用且灵活的语言,该语言既易于编写代码,又能处理高度复杂的数学处理。 Python 最适合此类要求,因为它已经确立了自己作为通用计算以及科学计算的语言。此外,它正在不断升级,形式是对其针对不同编程要求的众多库的新增。下面我们将讨论 Python 的这些特点,这些特点使其成为数据科学的首选语言。

The programming requirements of data science demands a very versatile yet flexible language which is simple to write the code but can handle highly complex mathematical processing. Python is most suited for such requirements as it has already established itself both as a language for general computing as well as scientific computing. More over it is being continuously upgraded in form of new addition to its plethora of libraries aimed at different programming requirements. Below we will discuss such features of python which makes it the preferred language for data science.

  1. A simple and easy to learn language which achieves result in fewer lines of code than other similar languages like R. Its simplicity also makes it robust to handle complex scenarios with minimal code and much less confusion on the general flow of the program.

  2. It is cross platform, so the same code works in multiple environments without needing any change. That makes it perfect to be used in a multi-environment setup easily.

  3. It executes faster than other similar languages used for data analysis like R and MATLAB.

  4. Its excellent memory management capability, especially garbage collection makes it versatile in gracefully managing very large volume of data transformation, slicing, dicing and visualization.

  5. Most importantly Python has got a very large collection of libraries which serve as special purpose analysis tools. For example – the NumPy package deals with scientific computing and its array needs much less memory than the conventional python list for managing numeric data. And the number of such packages is continuously growing.

  6. Python has packages which can directly use the code from other languages like Java or C. This helps in optimizing the code performance by using existing code of other languages, whenever it gives a better result.

在后续章节中,我们将看到如何利用这些 Python 特性来完成数据科学各个领域的所需所有任务。

In the subsequent chapters we will see how we can leverage these features of python to accomplish all the tasks needed in the different areas of Data Science.