Sas 简明教程

SAS - Concatenate Data Sets

多个 SAS 数据集可以使用 SET 语句连接,以提供单个数据集。连接的数据集中的观测总数是原始数据集中观测总数之和。观测顺序是顺序的。第一个数据集中的所有观测都紧跟第二个数据集中的所有观测,依此类推。

Multiple SAS data sets can be concatenated to give a single data set using the SET statement. The total number of observations in the concatenated data set is the sum of the number of observations in the original data sets. The order of observations is sequential. All observations from the first data set are followed by all observations from the second data set, and so on.

理想情况下,所有合并的数据集都具有相同的变量,但如果变量数量不同,则结果中将显示所有变量,较小的数据集将出现缺失值。

Ideally all the combining data sets have same variables, but in case they have different number of variables, then in the result all the variables appear, with missing values for the smaller data set.

Syntax

SAS 中 SET 语句的基本语法是 −

The basic syntax for SET statement in SAS is −

SET data-set 1 data-set 2 data-set 3.....;

以下是所用参数的描述 -

Following is the description of the parameters used −

  1. data-set1,data-set2 are dataset names written one after another.

Example

考虑一个组织的员工数据,该数据可在两个不同的数据集中获得,一个用于 IT 部门,另一个用于非 IT 部门。为了获得所有员工的完整详细信息,我们使用如下所示的 SET 语句连接两个数据集。

Consider the employee data of an organization which is available in two different data sets, one for the IT department and another for Non-It department. To get the complete details of all the employees we concatenate both the data sets using the SET statement shown as below.

DATA ITDEPT;
   INPUT empid name $ salary  ;
DATALINES;
1 Rick 623.3
3 Mike 611.5
6 Tusar 578.6
;
RUN;
DATA NON_ITDEPT;
   INPUT empid name $ salary  ;
DATALINES;
2 Dan 515.2
4 Ryan 729.1
5 Gary 843.25
7 Pranab 632.8
8 Rasmi 722.5
RUN;
DATA All_Dept;
   SET ITDEPT NON_ITDEPT;
RUN;
PROC PRINT DATA = All_Dept;
RUN;

当以上代码执行时,我们会得到以下输出:

When the above code is executed, we get the following output.

concateate1

Scenarios

当连接数据集的变体较多时,变量的结果可能不同,但连接的数据集中观测总数始终是每个数据集中观测的总和。我们将在下面考虑许多关于此变体的场景。

When we have many variations in the data sets for concatenation, the result of variables can differ but the total number of observations in the concatenated data set is always the sum of the observations in each data set. We will consider below many scenarios on this variation.

Different number of variables

如果原始数据集之一具有比另一数据集更多的变量,则数据集仍然会被组合,但在较小的数据集中,这些变量显示为缺失。

If one of the original data set has more number of variables then another, then the data sets still get combined but in the smaller data set those variables appear as missing.

Example

在以下示例中,第一个数据集有一个名为 DOJ 的额外变量。在结果中,第二个数据集的 DOJ 值显示为缺失。

In below example the first data set has an extra variable named DOJ. In the result the value of DOJ for second data set will appear as missing.

DATA ITDEPT;
   INPUT empid name $ salary DOJ date9.  ;
DATALINES;
1 Rick 623.3 02APR2001
3 Mike 611.5 21OCT2000
6 Tusar 578.6 01MAR2009
;
RUN;
DATA NON_ITDEPT;
   INPUT empid name $ salary  ;
DATALINES;
2 Dan 515.2
4 Ryan 729.1
5 Gary 843.25
7 Pranab 632.8
8 Rasmi 722.5
RUN;
DATA All_Dept;
   SET ITDEPT NON_ITDEPT;
RUN;
PROC PRINT DATA = All_Dept;
RUN;

当以上代码执行时,我们会得到以下输出:

When the above code is executed, we get the following output.

concateate2

Different variable name

在此场景中,数据集具有相同数量的变量,但变量名在其之间不同。在这种情况下,常规连接将在结果集中生成所有变量,并对不同的两个变量给出缺失结果。虽然我们可能不会更改原始数据集中变量的名称,但我们可以在创建的连接数据集中应用 RENAME 函数。这会产生与常规连接相同的结果,当然,在一个新变量名称取代原始数据集中存在的两个不同变量名称。

In this scenario the data sets have same number of variables but a variable name differs between them. In that case a normal concatenation will produce all the variables in the result set and giving missing results for the two variables which differ. While we may not change the variable name in the original data sets we can apply the RENAME function in the concatenated data set we create. That will produce the same result as a normal concatenation but of course with one new variable name in place of two different variable names present in the original data set.

Example

在以下示例中,数据集 ITDEPT 的变量名称为 ename ,而数据集 NON_ITDEPT *has the variable name *empname. 但这两个变量都表示相同的类型(字符)。我们在 SET 语句中应用 RENAME 函数,如下所示。

In the below example data set ITDEPT has the variable name ename whereas the data set NON_ITDEPT *has the variable name *empname. But both of these variables represent the same type(character). We apply the RENAME function in the SET statement as shown below.

DATA ITDEPT;
   INPUT empid ename $ salary  ;
DATALINES;
1 Rick 623.3
3 Mike 611.5
6 Tusar 578.6
;
RUN;
DATA NON_ITDEPT;
   INPUT empid empname $ salary  ;
DATALINES;
2 Dan 515.2
4 Ryan 729.1
5 Gary 843.25
7 Pranab 632.8
8 Rasmi 722.5
RUN;
DATA All_Dept;
   SET ITDEPT(RENAME =(ename = Employee) ) NON_ITDEPT(RENAME =(empname = Employee) );
RUN;
PROC PRINT DATA = All_Dept;
RUN;

当以上代码执行时,我们会得到以下输出:

When the above code is executed, we get the following output.

concateate3

Different variable lengths

如果两个数据集中变量的长度不同,则连接的数据集中将包含一些数据被截断具有较小长度的变量。如果第一个数据集的长度较小,则会发生这种情况。为了解决此问题,我们对两个数据集都应用了更高的长度,如下所示。

If the variable lengths in the two data sets is different than the concatenated data set will have values in which some data is truncated for the variable with smaller length. It happens if the first data set has a smaller length. To solve this we apply the higher length to both the data set as shown below.

Example

在以下示例中,变量 ename 在第一个数据集中长度为 5,在第二个数据集中长度为 7。在连接时,我们在连接的数据集中应用 LENGTH 语句,将 ename 长度设置为 7。

In the below example the variable ename is of length 5 in the first data set and 7 in the second. When concatenating we apply the LENGTH statement in the concatenated data set to set the ename length to 7.

DATA ITDEPT;
   INPUT  empid 1-2 ename $ 3-7 salary 8-14  ;
DATALINES;
1 Rick  623.3
3 Mike  611.5
6 Tusar 578.6
;
RUN;
DATA NON_ITDEPT;
   INPUT  empid 1-2 ename $ 3-9  salary 10-16 ;
DATALINES;
2 Dan    515.2
4 Ryan   729.1
5 Gary   843.25
7 Pranab 632.8
8 Rasmi  722.5
RUN;
DATA All_Dept;
   LENGTH ename $ 7   ;
   SET ITDEPT  NON_ITDEPT ;
RUN;
PROC PRINT DATA = All_Dept;
RUN;

当以上代码执行时,我们会得到以下输出:

When the above code is executed, we get the following output.

concateate4