Big Data Analytics 简明教程

Big Data Analytics - Text Analytics

在本章中,我们将使用本书第 1 部分中抓取的数据。数据包含描述自由职业者资料的文本,以及他们以美元收取的时薪。下一部分的想法是拟合一个模型,根据自由职业者的技能,我们可以预测他们的时薪。

In this chapter, we will be using the data scraped in the part 1 of the book. The data has text that describes profiles of freelancers, and the hourly rate they are charging in USD. The idea of the following section is to fit a model that given the skills of a freelancer, we are able to predict its hourly salary.

以下代码显示了如何将本例中的原始文本转换为词袋矩阵中的单词。为此,我们使用了一个名为 tm 的 R 库。这意味着对于语料库中的每个单词,我们都会创建变量,其中包含每个变量的出现次数。

The following code shows how to convert the raw text that in this case has skills of a user in a bag of words matrix. For this we use an R library called tm. This means that for each word in the corpus we create variable with the amount of occurrences of each variable.

library(tm)
library(data.table)

source('text_analytics/text_analytics_functions.R')
data = fread('text_analytics/data/profiles.txt')
rate = as.numeric(data$rate)
keep = !is.na(rate)
rate = rate[keep]

### Make bag of words of title and body
X_all = bag_words(data$user_skills[keep])
X_all = removeSparseTerms(X_all, 0.999)
X_all

# <<DocumentTermMatrix (documents: 389, terms: 1422)>>
#   Non-/sparse entries: 4057/549101
# Sparsity           : 99%
# Maximal term length: 80
# Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

### Make a sparse matrix with all the data
X_all <- as_sparseMatrix(X_all)

现在我们已经将文本表示为稀疏矩阵,我们可以拟合一个将提供稀疏解的模型。在这种情况下,一个很好的替代方法是使用 LASSO(最小绝对收缩和选择算子)。这是一个回归模型,能够选择预测目标的最相关特征。

Now that we have the text represented as a sparse matrix we can fit a model that will give a sparse solution. A good alternative for this case is using the LASSO (least absolute shrinkage and selection operator). This is a regression model that is able to select the most relevant features to predict the target.

train_inx = 1:200
X_train = X_all[train_inx, ]
y_train = rate[train_inx]
X_test = X_all[-train_inx, ]
y_test = rate[-train_inx]

# Train a regression model
library(glmnet)
fit <- cv.glmnet(x = X_train, y = y_train,
   family = 'gaussian', alpha = 1,
   nfolds = 3, type.measure = 'mae')
plot(fit)

# Make predictions
predictions = predict(fit, newx = X_test)
predictions = as.vector(predictions[,1])
head(predictions)

# 36.23598 36.43046 51.69786 26.06811 35.13185 37.66367
# We can compute the mean absolute error for the test data
mean(abs(y_test - predictions))
# 15.02175

现在我们有一个模型,可以根据一组技能预测自由职业者的时薪。如果收集更多数据,该模型的性能将得到提高,但实施此管道的代码将保持不变。

Now we have a model that given a set of skills is able to predict the hourly salary of a freelancer. If more data is collected, the performance of the model will improve, but the code to implement this pipeline would be the same.