Data Science 简明教程
Data Science - Prerequisites
你需要具备多种技术和非技术技能才能成为一名成功的数据科学家。具备某些技能对于成为一名知识渊博的数据科学家至关重要,而另一些技能只是为了让数据科学家的事情变得更容易。不同的工作角色决定了你必须具备的特定技能熟练程度。
You need to have several technical and non-technical skills to become a successful Data Scientist. Some of the skills are essential to have to become a well-versed data scientist while some for just for making thing things easier for a data scientist. Different job roles determine the level of skill-specific proficiency you need to possess.
以下列出了一些成为数据科学家所需具备的技能。
Given below are some skills you will require to become a data scientist.
Technical Skills
Python
数据科学家大量使用 Python,因为它是最流行的编程语言之一,易于学习,并拥有可用于数据操作和数据分析的大型库。因为它是一种灵活的语言,所以它可以在数据科学的所有阶段中使用,例如数据挖掘或运行应用程序。Python 拥有一个庞大的开源库,其中包含强大的数据科学库,如 Numpy、Pandas、Matplotlib、PyTorch、Keras、Scikit Learn、Seaborn 等。这些库有助于完成不同的数据科学任务,例如读取大型数据集,绘制和可视化数据和相关性,训练和拟合机器学习模型以适应你的数据,评估模型的性能等。
Data Scientists use Python a lot because it is one of the most popular programming languages, easy to learn and has extensive libraries that can be used for data manipulation and data analysis. Since it is a flexible language, it can be used in all stages of Data Science, such as data mining or running applications. Python has a huge open-source library with powerful Data Science libraries like Numpy, Pandas, Matplotlib, PyTorch, Keras, Scikit Learn, Seaborn, etc. These libraries help with different Data Science tasks, such as reading large datasets, plotting and visualizing data and correlations, training and fitting machine learning models to your data, evaluating the performance of the model, etc.
SQL
SQL 是在开始数据科学之前需要的另一个基本条件。与其他编程语言相比,SQL 相对简单,但要成为一名数据科学家是必需的。此编程语言用于管理和查询关系数据库存储的数据。我们可以使用 SQL 检索、插入、更新和删除数据。要从数据中提取见解,能够创建复杂的 SQL 查询(包括连接、分组、具有等)至关重要。连接方法使你能够同时查询多个表格。SQL 还可以执行分析操作和转换数据库结构。
SQL is an additional essential prerequisite before getting started with Data Science. SQL is relatively simple compared to other programming languages, but is required to become a Data Scientist. This programming language is used to manage and query relational database-stored data. We can retrieve, insert, update, and remove data with SQL. To extract insights from data, it is crucial to be able to create complicated SQL queries that include joins, group by, having, etc. The join method enables you to query many tables simultaneously. SQL also enables the execution of analytical operations and the transformation of database structures.
R
R 是一种高级语言,用于制作复杂的统计模型。R 还允许你使用阵列、矩阵和向量。R 以其图形库而闻名,使用户能够绘制精美的图表并使图表易于理解。
R is an advanced language that is used to make complex models of statistics. R also lets you work with arrays, matrices, and vectors. R is well-known for its graphical libraries, which let users draw beautiful graphs and make them easy to understand.
借助 R Shiny,程序员可以使用 R 制作 Web 应用程序,用于将可视化元素嵌入到网页中,并为用户提供大量与其交互的方式。此外,数据提取是数据科学的一个关键部分。R 允许你将 R 代码连接到数据库管理系统。
With R Shiny, programmers can make web applications using R, which is used to embed visualizations in web pages and gives users a lot of ways to interact with them. Also, data extraction is a key part of the science of data. R lets you connect your R code to database management systems.
R 还为你提供了更高级数据分析的多种选择,例如构建预测模型、机器学习算法等。R 还有许多用于处理图像的软件包。
R also gives you a number of options for more advanced data analysis, such as building prediction models, machine learning algorithms, etc. R also has a number of packages for processing images.
Statistics
在数据科学中,高度依赖统计才能存储和翻译数据模式用于预测的高级机器算法。数据科学家利用统计执行数据收集、评估、分析和从数据中推论结论,以及使用相关的量化数学模型和变量。数据科学家担任程序员、研究员和商务主管等职位,所有这些学科都具有统计基础。统计在数据科学中的重要性与编程语言相当。
In data science, advanced machine learning algorithms that stores and translate data patterns for prediction rely heavily on statistics. Data scientists utilize statistics to collect, assess, analyze, and derive conclusions from data, as well as to apply relevant quantitative mathematical models and variables. Data scientists work as programmers, researchers, and executives in business, among other roles, all of these disciplines have a statistical foundation. The importance of statistics in data science is comparable to that of programming languages.
Hadoop
数据科学家在海量数据上执行操作,但有时系统的内存无法处理这些海量数据。那么如何在如此海量的数据上执行数据处理?这里 Hadoop 就发挥了作用。它可用于快速分割数据并将其传输至多个服务器以进行数据处理和其他操作(如筛选)。尽管 Hadoop 基于分布式计算概念,但许多公司要求数据科学家基本了解分布式系统原则(如 Pig、Hive、MapReduce 等)。许多公司已经开始使用 Hadoop 即服务(HaaS),这是云中 Hadoop 的另一个名称,这样数据科学家就不需要了解 Hadoop 的内部工作原理。
Data scientists perform operations on enormous amount of data but sometimes the memory of the system is not able to carry out processing on these huge amount of data. So how data processing will be performed on such huge amount of data? Here Hadoop comes in the picture. It is used to rapidly divide and transfer data to numerous servers for data processing and other actions such as filtering. While Hadoop is based on the concept of Distributed Computing, several firms require that Data Scientists have a fundamental understanding of Distributed System principles such as Pig, Hive, MapReduce, etc. Several firms have begun to use Hadoop-as-a-Service (HaaS), another name for Hadoop in the cloud, so that Data Scientists do not need to understand Hadoop’s inner workings.
Spark
Spark 是一个用于大数据计算的框架,它在数据科学领域中越来越流行。Hadoop 从磁盘读取数据并写入数据,而 Spark 计算结果在系统内存中,使得它与 Hadoop 相比更容易且更快。Apache Spark 的功能是加快复杂算法的速度,其专门用于数据科学。如果数据集很大,它会分布式处理数据,这会节省大量时间。使用 Apache Spark 的主要原因在于其速度和为运行数据科学任务和流程提供的平台。Spark 可以在一台计算机或多个计算机集群上运行,使得使用 Spark 非常方便。
Spark is a framework for big data computation like Hadoop and has gained some popularity in Data Science world. Hadoop reads data from the disk and writes data to the disk while on the other hand Spark Calculates the computation results in the system memory, making it comparatively easy and faster than Hadoop. The function of Apache Spark is to facilitate the speed of the complex algorithms and it is specially designed for the data science. If the dataset is huge then it distributes data processing which saves a lot of time. The main reason of using apache spark is because of its speed and the platform it provides to run data science tasks and processes. It is possible to run Spark on a single machine or a cluster of machines which makes it convenient to work with.
Machine Learning
机器学习是数据科学的关键组成部分。机器学习算法是分析海量数据的有效方式。它可以帮助自动化各种相关数据科学操作。但是,对机器学习原理的深入了解并非在行业内开始职业生涯的必需条件。大多数数据科学家缺乏机器学习技能。只有极少数数据科学家对推荐引擎、对抗性学习、强化学习、自然语言处理、异常值检测、时序分析、计算机视觉、生存分析等高级主题拥有广泛的知识和专门知识。因此,这些能力将帮助你在数据科学职业中脱颖而出。
Machine Learning is crucial component of Data Science. Machine Learning algorithms are an effective method for analysing massive volumes of data. It may assist in automating a variety of Data Science-related operations. Nevertheless, an in-depth understanding of Machine Learning principles is not required to begin a career in this industry. The majority of Data Scientists lack skills in Machine Learning. Just a tiny fraction of Data Scientists has extensive knowledge and expertise in advanced topics such as Recommendation Engines, Adversarial Learning, Reinforcement Learning, Natural Language Processing, Outlier Detection, Time Series Analysis, Computer Vision, Survival Analysis, etc. These competencies will consequently help you stand out in a Data Science profession.
Non-Technical Skills
Understanding of Business Domain
数据科学家对特定业务领域或涉猎领域的了解越深入,就越容易对该特定领域的数据进行分析。
More understanding one has for a particular business area or domain, easier it will be for a data scientist to do the analysis on the data from that particular domain.
Understanding of Data
数据科学涉及所有数据,因此了解数据非常重要,例如什么是数据、如何存储数据、表、行和列的知识。
Data Science is all about data so it is very important to have an understanding of data that what is data, how data is stored, knowledge of tables, rows and columns.
Critical and Logical Thinking
批判性思维是在理解和明确了解思想如何匹配过程中明确、合乎逻辑地思考的能力。在数据科学中,您需要具备批判性思维,以便获得有用的见解并改善业务运营。批判性思维可能是数据科学中最重要的技能之一。它使他们能够更深入地挖掘信息并找出最重要的事情。
Critical thinking is the ability to think clearly and logically while figuring out and understanding how ideas fit together. In data science, you need to be able to think critically to get useful insights and improve business operations. Critical thinking is probably one of the most important skills in data science. It makes it easier for them to dig deeper into information and find the most important things.
Product Understanding
设计模型并非数据科学家的全部工作。数据科学家必须提出可用于提高产品质量的见解。通过系统方法,如果专业人士了解整个产品,他们可以快速加速。他们可以帮助模型启动(引导)并改善特性工程。此技能还可以帮助他们通过揭示以前可能没有想到过的有关产品的想法和见解来改善他们的讲故事能力。
Designing models isn’t the entire job of a data scientist. Data scientists have to come up with insights that can be used to improve the quality of products. With a systematic approach, professionals can accelerate quickly if they understand the whole product. They can help models get started (bootstrap) and improve feature engineering. This skill also helps them improve their storytelling by revealing thoughts and insights about products that they may not have thought of before.
Adaptability
在现代人才获取流程中,数据科学家最抢手的软技能之一是适应能力。由于新技术正在更快地被制造和使用,因此专业人士必须快速学会如何使用它们。作为一名数据科学家,您必须跟上不断变化的业务趋势并能够适应。
One of the most sought-after soft skills for data scientists in the modern talent acquisition process is the ability to adapt. Because new technologies are being made and used more quickly, professionals have to quickly learn how to use them. As a data scientist, you have to keep up with changing business trends and be able to adapt.