R 简明教程

R - Web Data

许多网站提供数据供其用户使用。例如，世界卫生组织 (WHO) 以 CSV、txt 和 XML 文件的形式提供有关健康和医疗信息。使用 R 程序，我们可以以编程方式从这些网站中提取特定数据。在 R 中用于抓取网络数据的某些包为：“RCurl”、“XML”和“stringr”。它们用于连接 URL，识别文件所需的链接，并将它们下载到本地环境。

Many websites provide data for consumption by its users. For example the World Health Organization(WHO) provides reports on health and medical information in the form of CSV, txt and XML files. Using R programs, we can programmatically extract specific data from such websites. Some packages in R which are used to scrap data form the web are − "RCurl",XML", and "stringr". They are used to connect to the URL’s, identify required links for the files and download them to the local environment.

Install R Packages

以下包对于处理 URL 和文件链接是必需的。如果这些包在 R 环境中不可用，可以使用以下命令进行安装。

The following packages are required for processing the URL’s and links to the files. If they are not available in your R Environment, you can install them using following commands.

install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("plyr")

Input Data

我们将访问 URL weather data 并使用 R 下载 2015 年的 CSV 文件。

We will visit the URL weather data and download the CSV files using R for the year 2015.

Example

我们将使用函数 getHTMLLinks() 收集文件的 URL。然后，我们将使用函数 download.file() 将文件保存到本地系统。由于我们将对多个文件重复应用相同的代码，因此我们将创建一个多次调用的函数。文件名以 R 列表对象的形式作为参数传递给此函数。

We will use the function getHTMLLinks() to gather the URLs of the files. Then we will use the function download.file() to save the files to the local system. As we will be applying the same code again and again for multiple files, we will create a function to be called multiple times. The filenames are passed as parameters in form of a R list object to this function.

# Read the URL.
url <- "http://www.geos.ed.ac.uk/~weather/jcmb_ws/"

# Gather the html links present in the webpage.
links <- getHTMLLinks(url)

# Identify only the links which point to the JCMB 2015 files.
filenames <- links[str_detect(links, "JCMB_2015")]

# Store the file names as a list.
filenames_list <- as.list(filenames)

# Create a function to download the files by passing the URL and filename list.
downloadcsv <- function (mainurl,filename) {
   filedetails <- str_c(mainurl,filename)
   download.file(filedetails,filename)
}

# Now apply the l_ply function and save the files into the current R working directory.
l_ply(filenames,downloadcsv,mainurl = "http://www.geos.ed.ac.uk/~weather/jcmb_ws/")

Verify the File Download

运行以上代码后，您可以在当前 R 工作目录中找到以下文件。

After running the above code, you can locate the following files in the current R working directory.

"JCMB_2015.csv" "JCMB_2015_Apr.csv" "JCMB_2015_Feb.csv" "JCMB_2015_Jan.csv"
   "JCMB_2015_Mar.csv"