Big Data Analytics 简明教程
Big Data Analytics - Data Analysis Tools
各种工具可供数据科学家有效地分析数据。通常,数据分析的工程方面侧重于数据库,数据科学家重点关注可以实现数据产品的工具。下一部分讨论了不同工具的优点,重点是数据科学家最常在实践中使用的统计软件包。
There are a variety of tools that allow a data scientist to analyze data effectively. Normally the engineering aspect of data analysis focuses on databases, data scientist focus in tools that can implement data products. The following section discusses the advantages of different tools with a focus on statistical packages data scientist use in practice most often.
R Programming Language
R 是一种专注于统计分析的开源编程语言。它在统计能力方面与 SAS、SPSS 等商业工具具有竞争力。它被认为是与 C、C++ 或 Fortran 等其他编程语言的接口。
R is an open source programming language with a focus on statistical analysis. It is competitive with commercial tools such as SAS, SPSS in terms of statistical capabilities. It is thought to be an interface to other programming languages such as C, C++ or Fortran.
R 的另一个优点是大量可用的开源库。在 CRAN 中,有 6000 多个可免费下载的软件包,在 Github 中提供各种 R 软件包。
Another advantage of R is the large number of open source libraries that are available. In CRAN there are more than 6000 packages that can be downloaded for free and in Github there is a wide a variety of R packages available.
在性能方面,对于密集操作,R 很慢,鉴于大量可用的库,代码的慢速部分是用编译语言编写的。但如果你打算执行需要编写深度循环的操作,那么 R 不会是你最好的选择。对于数据分析目的,有一些不错的库,如 data.table, glmnet, ranger, xgboost, ggplot2, caret ,它允许使用 R 作为更快速编程语言的接口。
In terms of performance, R is slow for intensive operations, given the large amount of libraries available the slow sections of the code are written in compiled languages. But if you are intending to do operations that require writing deep for loops, then R wouldn’t be your best alternative. For data analysis purpose, there are nice libraries such as data.table, glmnet, ranger, xgboost, ggplot2, caret that allow to use R as an interface to faster programming languages.
Python for data analysis
Python 是一种通用编程语言,它包含大量的专门用于数据分析的库,如 pandas, scikit-learn, theano, numpy 和 scipy 。
Python is a general purpose programming language and it contains a significant number of libraries devoted to data analysis such as pandas, scikit-learn, theano, numpy and scipy.
R 中的大多数功能也可以在 Python 中实现,但我们发现 R 更易于使用。如果你处理的是大型数据集,通常 Python 比 R 更合适。Python 可以非常有效地逐行清理和处理数据。这可以通过 R 实现,但对于脚本任务,它不如 Python 那么有效。
Most of what’s available in R can also be done in Python but we have found that R is simpler to use. In case you are working with large datasets, normally Python is a better choice than R. Python can be used quite effectively to clean and process data line by line. This is possible from R but it’s not as efficient as Python for scripting tasks.
对于机器学习, scikit-learn 是一个很好的环境,它提供大量算法,可以无问题地处理中等规模的数据集。与 R 的等效库(caret)相比, scikit-learn 具有更清晰和更一致的 API。
For machine learning, scikit-learn is a nice environment that has available a large amount of algorithms that can handle medium sized datasets without a problem. Compared to R’s equivalent library (caret), scikit-learn has a cleaner and more consistent API.
Julia
Julia 是一种用于技术计算的高级、高性能动态编程语言。它的语法与 R 或 Python 非常相似,因此,如果你已经使用 R 或 Python,那么用 Julia 编写相同的代码应该相当简单。该语言相当新,并且在最近几年发展得非常显着,因此,它在目前肯定是一种选择。
Julia is a high-level, high-performance dynamic programming language for technical computing. Its syntax is quite similar to R or Python, so if you are already working with R or Python it should be quite simple to write the same code in Julia. The language is quite new and has grown significantly in the last years, so it is definitely an option at the moment.
我们建议使用 Julia 来对计算密集型算法(如神经网络)进行原型设计。它是一个非常好的研究工具。在生产中实现模型方面,Python 可能有更好的选择。然而,随着有 Web 服务来实施 R、Python 和 Julia 中的模型,这个问题正在变得不再那么严重。
We would recommend Julia for prototyping algorithms that are computationally intensive such as neural networks. It is a great tool for research. In terms of implementing a model in production probably Python has better alternatives. However, this is becoming less of a problem as there are web services that do the engineering of implementing models in R, Python and Julia.
SAS
SAS 是一种仍然用于商业智能的商业语言。它具有一个基本语言,允许用户编制各种应用程序。它包含一些商业产品,使非专家用户能够在不编程的情况下使用神经网络库等复杂工具。
SAS is a commercial language that is still being used for business intelligence. It has a base language that allows the user to program a wide variety of applications. It contains quite a few commercial products that give non-experts users the ability to use complex tools such as a neural network library without the need of programming.
除了商业工具显而易见的缺点之外,SAS 无法很好地扩展到大型数据集。即使是中等规模的数据集也会使 SAS 出现问题并导致服务器崩溃。只有当你使用小数据集并且用户不是专家数据科学家时,才推荐使用 SAS。对于高级用户,R 和 Python 提供了一个更高效的环境。
Beyond the obvious disadvantage of commercial tools, SAS doesn’t scale well to large datasets. Even medium sized dataset will have problems with SAS and make the server crash. Only if you are working with small datasets and the users aren’t expert data scientist, SAS is to be recommended. For advanced users, R and Python provide a more productive environment.
SPSS
SPSS 目前是 IBM 的统计分析产品。它主要用于分析调研数据,对于不会编程的用户来说,这是一种不错的替代方案。它的易用性可能与 SAS 相当,但在实现模型方面,它更简单,因为它提供了 SQL 代码来对模型进行评分。此代码通常效率不高,但这是一个开始,而 SAS 单独为每个数据库销售用于对模型进行评分的产品。对于小数据和没有经验的团队而言,SPSS 是与 SAS 一样好的一个选择。
SPSS, is currently a product of IBM for statistical analysis. It is mostly used to analyze survey data and for users that are not able to program, it is a decent alternative. It is probably as simple to use as SAS, but in terms of implementing a model, it is simpler as it provides a SQL code to score a model. This code is normally not efficient, but it’s a start whereas SAS sells the product that scores models for each database separately. For small data and an unexperienced team, SPSS is an option as good as SAS is.
但是,该软件的局限性相当大,有经验的用户使用 R 或 Python 的工作效率将高出几个数量级。
The software is however rather limited, and experienced users will be orders of magnitude more productive using R or Python.
Matlab, Octave
还有其他工具可用,例如 Matlab 或其开源版本(Octave)。这些工具主要用于研究。就功能来说,R 或 Python 可以完成 Matlab 或 Octave 中的所有功能。如果您对他们提供的支持感兴趣的话,购买产品许可证才有意义。
There are other tools available such as Matlab or its open source version (Octave). These tools are mostly used for research. In terms of capabilities R or Python can do all that’s available in Matlab or Octave. It only makes sense to buy a license of the product if you are interested in the support they provide.