Big Data Analytics 简明教程
Big Data Analytics - Data Collection
数据收集在“大数据”周期中发挥着最重要的作用。Internet 几乎提供了无限的数据源,可用于各种主题。该领域的重要性取决于业务类型,但传统行业可以获取外部数据的多元化来源,并将这些数据与其交易数据相结合。
Data collection plays the most important role in the Big Data cycle. The Internet provides almost unlimited sources of data for a variety of topics. The importance of this area depends on the type of business, but traditional industries can acquire a diverse source of external data and combine those with their transactional data.
例如,让我们假设要构建一个推荐餐厅的系统。第一步是收集数据,在本例中,从不同网站收集餐厅评论,并将它们存储在数据库中。由于我们对原始文本感兴趣,并且会将其用于分析,因此在何处存储用于开发模型的数据并不是那么重要。这听起来可能与大数据主要技术相矛盾,但为了实现大数据应用程序,我们只需要让它实时运行即可。
For example, let’s assume we would like to build a system that recommends restaurants. The first step would be to gather data, in this case, reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data main technologies, but in order to implement a big data application, we simply need to make it work in real time.
Twitter Mini Project
定义问题后,接下来的阶段是收集数据。以下小项目的想法是对收集自网络的数据进行处理,并对其进行结构化,以便用于机器学习模型。我们将使用 R 编程语言从 Twitter Rest API 收集一些推文。
Once the problem is defined, the following stage is to collect the data. The following miniproject idea is to work on collecting data from the web and structuring it to be used in a machine learning model. We will collect some tweets from the twitter rest API using the R programming language.
首先创建一个 Twitter 帐户,然后按照 twitteR 包 vignette 中的说明创建一个 Twitter 开发者帐户。以下是这些说明的摘要——
First of all create a twitter account, and then follow the instructions in the twitteR package vignette to create a twitter developer account. This is a summary of those instructions −
-
Go to https://twitter.com/apps/new and log in.
-
After filling in the basic info, go to the "Settings" tab and select "Read, Write and Access direct messages".
-
Make sure to click on the save button after doing this
-
In the "Details" tab, take note of your consumer key and consumer secret
-
In your R session, you’ll be using the API key and API secret values
-
Finally run the following script. This will install the twitteR package from its repository on github.
install.packages(c("devtools", "rjson", "bit64", "httr"))
# Make sure to restart your R session at this point
library(devtools)
install_github("geoffjentry/twitteR")
我们感兴趣的是,在其中包括字符串“big mac”的数据,并找出有关此字符串的突出主题。为此,第一步是从 Twitter 收集数据。以下是我们的 R 脚本,用于从 Twitter 收集所需数据。此代码也位于 bda/part1/collect_data/collect_data_twitter.R 文件中。
We are interested in getting data where the string "big mac" is included and finding out which topics stand out about this. In order to do this, the first step is collecting the data from twitter. Below is our R script to collect required data from twitter. This code is also available in bda/part1/collect_data/collect_data_twitter.R file.
rm(list = ls(all = TRUE)); gc() # Clears the global environment
library(twitteR)
Sys.setlocale(category = "LC_ALL", locale = "C")
### Replace the xxx’s with the values you got from the previous instructions
# consumer_key = "xxxxxxxxxxxxxxxxxxxx"
# consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# access_token = "xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# access_token_secret= "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Connect to twitter rest API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_token_secret)
# Get tweets related to big mac
tweets <- searchTwitter(’big mac’, n = 200, lang = ’en’)
df <- twListToDF(tweets)
# Take a look at the data
head(df)
# Check which device is most used
sources <- sapply(tweets, function(x) x$getStatusSource())
sources <- gsub("</a>", "", sources)
sources <- strsplit(sources, ">")
sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1]))
source_table = table(sources)
source_table = source_table[source_table > 1]
freq = source_table[order(source_table, decreasing = T)]
as.data.frame(freq)
# Frequency
# Twitter for iPhone 71
# Twitter for Android 29
# Twitter Web Client 25
# recognia 20