Big Data Analytics 简明教程

Big Data Analytics - Cleansing Data

一旦收集到数据，我们通常会有不同特征的多元化数据源。最直接的步骤是让这些数据源均质化，并继续开发我们的数据产品。然而，这取决于数据的类型。我们应该问自己，均质化数据是否可行。

Once the data is collected, we normally have diverse data sources with different characteristics. The most immediate step would be to make these data sources homogeneous and continue to develop our data product. However, it depends on the type of data. We should ask ourselves if it is practical to homogenize the data.

也许数据源完全不同，并且如果将源均质化的话，信息丢失将会很大。在这种情况下，我们可以考虑替代方案。一个数据源可以帮助我构建回归模型，另一个数据源可以构建分类模型吗？是否可以利用我们的优势来处理异质性，而不是仅仅丢失信息？做出这些决策让分析变得有趣且具有挑战性。

Maybe the data sources are completely different, and the information loss will be large if the sources would be homogenized. In this case, we can think of alternatives. Can one data source help me build a regression model and the other one a classification model? Is it possible to work with the heterogeneity on our advantage rather than just lose information? Taking these decisions are what make analytics interesting and challenging.

对于评论而言，每个数据源都可以有一种语言。同样，我们有两个选择——

In the case of reviews, it is possible to have a language for each data source. Again, we have two choices −

Homogenization − It involves translating different languages to the language where we have more data. The quality of translations services is acceptable, but if we would like to translate massive amounts of data with an API, the cost would be significant. There are software tools available for this task, but that would be costly too.
Heterogenization − Would it be possible to develop a solution for each language? As it is simple to detect the language of a corpus, we could develop a recommender for each language. This would involve more work in terms of tuning each recommender according to the amount of languages available but is definitely a viable option if we have a few languages available.

Twitter Mini Project

在本例中，我们需要先清理非结构化数据，然后将其转换为数据矩阵，以便对它应用主题建模。一般来说，从 Twitter 获取数据时，有几个字符我们不想使用，至少在数据清理过程的第一阶段如此。

In the present case we need to first clean the unstructured data and then convert it to a data matrix in order to apply topics modelling on it. In general, when getting data from twitter, there are several characters we are not interested in using, at least in the first stage of the data cleansing process.

例如，在获得推特后，我们会得到以下奇怪字符：“<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>”。这些可能是表情符号，因此为了清除数据，我们将使用以下脚本进行删除。此代码也在 bda/part1/collect_data/cleaning_data.R 文件中提供。

For example, after getting the tweets we get these strange characters: "<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>". These are probably emoticons, so in order to clean the data, we will just remove them using the following script. This code is also available in bda/part1/collect_data/cleaning_data.R file.

rm(list = ls(all = TRUE)); gc() # Clears the global environment
source('collect_data_twitter.R')
# Some tweets
head(df$text)

[1] "I’m not a big fan of turkey but baked Mac &
cheese <ed><U+00A0><U+00BD><ed><U+00B8><U+008B>"
[2] "@Jayoh30 Like no special sauce on a big mac. HOW"
### We are interested in the text - Let’s clean it!

# We first convert the encoding of the text from latin1 to ASCII
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub = ""))

# Create a function to clean tweets
clean.text <- function(tx) {
  tx <- gsub("htt.{1,20}", " ", tx, ignore.case = TRUE)
  tx = gsub("[^#[:^punct:]]|@|RT", " ", tx, perl = TRUE, ignore.case = TRUE)
  tx = gsub("[[:digit:]]", " ", tx, ignore.case = TRUE)
  tx = gsub(" {1,}", " ", tx, ignore.case = TRUE)
  tx = gsub("^\\s+|\\s+$", " ", tx, ignore.case = TRUE)
  return(tx)
}

clean_tweets <- lapply(df$text, clean.text)

# Cleaned tweets
head(clean_tweets)
[1] " WeNeedFeminlsm MAC s new make up line features men woc and big girls "
[1] " TravelsPhoto What Happens To Your Body One Hour After A Big Mac "

数据清理迷你项目的最后一步是清除可以转换为矩阵并应用算法的文本。通过存储在 clean_tweets 向量中的文本，我们可以轻松地将其转换为词袋矩阵并应用非监督学习算法。

The final step of the data cleansing mini project is to have cleaned text we can convert to a matrix and apply an algorithm to. From the text stored in the clean_tweets vector we can easily convert it to a bag of words matrix and apply an unsupervised learning algorithm.