Sas 简明教程

SAS - Merge Data Sets

多个 SAS 数据集可以基于一个特定的公共变量进行合并,以提供单个数据集。这是使用 MERGE 语句和 BY 语句完成的。合并数据集中的观测总数通常小于原始数据集中观测总数的总和。这是因为当公共变量的值匹配时,两个数据集中的变量会作为一个记录合并。

Multiple SAS data sets can be merged based on a specific common variable to give a single data set. This is done using the MERGE statement and BY statement. The total number of observations in the merged data set is often less than the sum of the number of observations in the original data sets. It is because the variables form both data sets get merged as one record based when there is a match in the value of the common variable.

下面给出了合并数据集的两个前提条件:

There are two Prerequisites for merging data sets given below −

  1. input data sets must have at least one common variable to merge on.

  2. input data sets must be sorted by the common variable(s) that will be used to merge on.

Syntax

SAS 中 MERGE 和 BY 语句的基本语法为:

The basic syntax for MERGE and BY statement in SAS is −

MERGE Data-Set 1 Data-Set 2
BY Common Variable

以下是所用参数的描述 -

Following is the description of the parameters used −

  1. Data-set1,Data-set2 are data set names written one after another.

  2. Common Variable is the variable based on whose matching values the data sets will be merged.

Data Merging

让我们借助示例了解数据合并。

Let us understand data merging with the help of an example.

Example

考虑两个 SAS 数据集,一个包含附有姓名和工资的员工 ID,另一个包含附有员工 ID 和部门的员工 ID。在这种情况下,为了获得每个员工的完整信息,我们可以合并这两个数据集。最终数据集仍然对每个员工有一个观测值,但它既包含工资变量,也包含部门变量。

Consider two SAS data sets one containing the employee ID with name and salary and another containing employee ID with employee ID and department. In this case to get the complete information for each employee we can merge these two data sets. The final data set will still have one observation per employee but it will contain both the salary and department variables.

# Data set 1
ID NAME SALARY
1 Rick 623.3
2 Dan 515.2
3 Mike 611.5
4 Ryan 729.1
5 Gary 843.25
6 Tusar 578.6
7 Pranab 632.8
8 Rasmi 722.5

# Data set 2
ID DEPT
1 IT
2 OPS
3 IT
4 HR
5 FIN
6 IT
7 OPS
8 FIN

# Merged data set
ID NAME SALARY DEPT
1 Rick 623.3	IT
2 Dan 515.2 	OPS
3 Mike 611.5 	IT
4 Ryan 729.1    HR
5 Gary 843.25   FIN
6 Tusar 578.6   IT
7 Pranab 632.8  OPS
8 Rasmi 722.5   FIN

可以使用在 BY 语句中使用公共变量 (ID) 的以下代码来实现上述结果。请注意,两个数据集中的观测值已经在 ID 列中进行排序。

The above result is achieved by using the following code in which the common variable (ID) is used in the BY statement. Please note that the observations in both the datasets are already sorted in ID column.

DATA SALARY;
   INPUT empid name $ salary  ;
DATALINES;
1 Rick 623.3
2 Dan 515.2
3 Mike 611.5
4 Ryan 729.1
5 Gary 843.25
6 Tusar 578.6
7 Pranab 632.8
8 Rasmi 722.5
;
RUN;
DATA DEPT;
   INPUT empid dEPT $ ;
DATALINES;
1 IT
2 OPS
3 IT
4 HR
5 FIN
6 IT
7 OPS
8 FIN
;
RUN;
DATA All_details;
MERGE SALARY DEPT;
BY (empid);
RUN;
PROC PRINT DATA = All_details;
RUN;

Missing Values in the Matching Column

在某些情况下,公共变量的某些值在数据集中可能无法匹配。在这种情况下,数据集仍然可以合并,但在结果中会出现缺失值。

There may be cases when some values of the common variable will not match between the data sets. In such cases the data sets still get merged but give missing values in the result.

Example

ID NAME SALARY DEPT
1 Rick 623.3	IT
2 Dan 515.2 	OPS
3 .		.		IT
4 Ryan 729.1    HR
5 Gary 843.25   FIN
6 Tusar 578.6   .
7 Pranab 632.8  OPS
8 Rasmi 722.5   FIN

Merging only the Matches

为了避免结果中的缺失值,我们可以考虑仅保留公共变量值匹配的观测值。可以通过使用 IN 语句实现。需要更改 SAS 程序的合并语句。

To avoid the missing values in the result we can consider keeping only the observations with matched values for the common variable. That is achieved by using the IN statement. The merge statement of the SAS program needs to be changed.

Example

在以下示例中, IN = 值仅保留来自数据集 SALARYDEPT 的值匹配的观测值。

In the below example, the IN= value keeps only the observations where the values from both the data sets SALARY and DEPT match.

DATA All_details;
MERGE SALARY(IN = a) DEPT(IN = b);
BY (empid);
IF a = 1 and b = 1;
RUN;
PROC PRINT DATA = All_details;
RUN;

执行带有上述更改部分的以上 SAS 程序后,会得到以下输出。

Upon execution of the above SAS program with the above changed part, we get the following output.

1 Rick 623.3	IT
2 Dan 515.2 	OPS
4 Ryan 729.1    HR
5 Gary 843.25   FIN
7 Pranab 632.8  OPS
8 Rasmi 722.5   FIN