FlatFileItemReader
-
Resource:表示 Spring Core Resource,用于识别待处理文件的路径。
-
LineMapper:将 String 行转换为对象,允许将平面文件数据标记化为字段集和映射为域对象。
LineMapper 接口涉及两个基本任务:将行标记化为字段集和将字段集映射到域对象。Spring Batch 提供了抽象,例如 LineTokenizer(标记化行)和 FieldSetMapper(映射字段集)来执行这些任务。
默认情况下,DefaultLineMapper 使用 LineTokenizer 和 FieldSetMapper 来转换行。对于简单的分隔文件,可以使用 DelimitedLineTokenizer 和 PlayerFieldSetMapper 等预构建的实现。
对于定长文件格式,FixedLengthTokenizer 提供了标记化功能,而 BeanWrapperFieldSetMapper 可以根据 JavaBean 规范自动映射字段。
为了处理文件中具有不同格式的记录,可以配置 PatternMatchingCompositeLineMapper,它通过模式匹配将行映射到 LineTokenizers 和 FieldSetMappers。
Spring Batch 还提供异常处理功能,当在标记化行时遇到问题时会抛出异常。这些异常包括 FlatFileParseException 和 FlatFileFormatException,它们提供有关错误原因的详细信息。
平面文件是最多包含二维(表格)数据的任何类型的文件。通过称为 FlatFileItemReader
的类可以轻松地在 Spring Batch 框架中读取平面文件,该类提供了读取和解析平面文件の基本的な機能。FlatFileItemReader
的两个最重要的必需依赖项是 Resource
和 LineMapper
。LineMapper
接口将在下一部分中进行详细介绍。资源属性表示 Spring Core Resource
。有关如何创建此类型的 Bean 的文档可以在 Spring
Framework, Chapter 5. Resources 中找到。因此,本指南不会详细介绍创建 Resource
对象,而仅展示以下简单的示例:
A flat file is any type of file that contains at most two-dimensional (tabular) data.
Reading flat files in the Spring Batch framework is facilitated by the class called
FlatFileItemReader
, which provides basic functionality for reading and parsing flat
files. The two most important required dependencies of FlatFileItemReader
are
Resource
and LineMapper
. The LineMapper
interface is explored more in the next
sections. The resource property represents a Spring Core Resource
. Documentation
explaining how to create beans of this type can be found in
Spring
Framework, Chapter 5. Resources. Therefore, this guide does not go into the details of
creating Resource
objects beyond showing the following simple example:
Resource resource = new FileSystemResource("resources/trades.csv");
在复杂批处理环境中,目录结构通常由企业应用程序集成 (EAI) 基础设施管理,其中外部接口的放置区域用于将文件从 FTP 位置移动到批处理位置,反之亦然。文件移动工具超出了 Spring Batch 架构的范围,但批处理作业流中包含文件移动工具作为作业流中的步骤并不罕见。批处理架构只需知道如何找到要处理的文件。Spring Batch 从此开始将数据引入管道。但是, Spring Integration 提供了许多此类服务。
In complex batch environments, the directory structures are often managed by the Enterprise Application Integration (EAI) infrastructure, where drop zones for external interfaces are established for moving files from FTP locations to batch processing locations and vice versa. File moving utilities are beyond the scope of the Spring Batch architecture, but it is not unusual for batch job streams to include file moving utilities as steps in the job stream. The batch architecture only needs to know how to locate the files to be processed. Spring Batch begins the process of feeding the data into the pipe from this starting point. However, Spring Integration provides many of these types of services.
FlatFileItemReader
中的其他属性允许您进一步指定如何解释数据,如下表所述:
The other properties in FlatFileItemReader
let you further specify how your data is
interpreted, as described in the following table:
Property | Type | Description |
---|---|---|
comments |
String[] |
Specifies line prefixes that indicate comment rows. |
encoding |
String |
Specifies what text encoding to use. The default value is |
lineMapper |
|
Converts a |
linesToSkip |
int |
Number of lines to ignore at the top of the file. |
recordSeparatorPolicy |
RecordSeparatorPolicy |
Used to determine where the line endings are and do things like continue over a line ending if inside a quoted string. |
resource |
|
The resource from which to read. |
skippedLinesCallback |
LineCallbackHandler |
Interface that passes the raw line content of
the lines in the file to be skipped. If |
strict |
boolean |
In strict mode, the reader throws an exception on |
LineMapper
与 RowMapper
(它采用底层结构(如 ResultSet
)并返回一个 Object
)一样,平面文件处理需要相同的结构才能将 String
行转换为 Object
,如下面的接口定义所示:
As with RowMapper
, which takes a low-level construct such as ResultSet
and returns
an Object
, flat file processing requires the same construct to convert a String
line
into an Object
, as shown in the following interface definition:
public interface LineMapper<T> {
T mapLine(String line, int lineNumber) throws Exception;
}
基本约定是,给定当前行及其关联的行号,映射器应返回一个结果域对象。这类似于 RowMapper
,因为每行都与其行号关联,就像 ResultSet
中的每一行都与其行号相关联一样。这允许将行号与结果域对象关联,以便进行身份比较或进行更具信息性的日志记录。但是,与 RowMapper
不同,LineMapper
会得到一个原始行,如上所述,这只能让您完成任务的一半。必须将该行标记化成 FieldSet
,然后可以将其映射到对象,如本文档后面的内容所述。
The basic contract is that, given the current line and the line number with which it is
associated, the mapper should return a resulting domain object. This is similar to
RowMapper
, in that each line is associated with its line number, just as each row in a
ResultSet
is tied to its row number. This allows the line number to be tied to the
resulting domain object for identity comparison or for more informative logging. However,
unlike RowMapper
, the LineMapper
is given a raw line which, as discussed above, only
gets you halfway there. The line must be tokenized into a FieldSet
, which can then be
mapped to an object, as described later in this document.
LineTokenizer
需要一个用于将输入行转换为 FieldSet
的抽象,因为需要将许多格式的平面文件数据转换为 FieldSet
。在 Spring Batch 中,此接口是 LineTokenizer
:
An abstraction for turning a line of input into a FieldSet
is necessary because there
can be many formats of flat file data that need to be converted to a FieldSet
. In
Spring Batch, this interface is the LineTokenizer
:
public interface LineTokenizer {
FieldSet tokenize(String line);
}
LineTokenizer
的约定是,给定一个输入行(理论上,String
可以包含多行),返回代表该行的 FieldSet
。此 FieldSet
然后可以传递给 FieldSetMapper
。Spring Batch 包含以下 LineTokenizer
实现:
The contract of a LineTokenizer
is such that, given a line of input (in theory the
String
could encompass more than one line), a FieldSet
representing the line is
returned. This FieldSet
can then be passed to a FieldSetMapper
. Spring Batch contains
the following LineTokenizer
implementations:
-
DelimitedLineTokenizer
: Used for files where fields in a record are separated by a delimiter. The most common delimiter is a comma, but pipes or semicolons are often used as well. -
FixedLengthTokenizer
: Used for files where fields in a record are each a "fixed width". The width of each field must be defined for each record type. -
PatternMatchingCompositeLineTokenizer
: Determines whichLineTokenizer
among a list of tokenizers should be used on a particular line by checking against a pattern.
FieldSetMapper
FieldSetMapper
接口定义了一种方法 mapFieldSet
,该方法采用一个 FieldSet
对象并将它的内容映射到一个对象。根据作业的需要,此对象可以是自定义 DTO、域对象或数组。FieldSetMapper
与 LineTokenizer
结合使用,将资源中的数据行转换为所需类型的一个对象,如下面的接口定义所示:
The FieldSetMapper
interface defines a single method, mapFieldSet
, which takes a
FieldSet
object and maps its contents to an object. This object may be a custom DTO, a
domain object, or an array, depending on the needs of the job. The FieldSetMapper
is
used in conjunction with the LineTokenizer
to translate a line of data from a resource
into an object of the desired type, as shown in the following interface definition:
public interface FieldSetMapper<T> {
T mapFieldSet(FieldSet fieldSet) throws BindException;
}
所使用的模式与 JdbcTemplate
使用的 RowMapper
相同。
The pattern used is the same as the RowMapper
used by JdbcTemplate
.
DefaultLineMapper
现在已经定义了读取平面文件的基本接口,很明显需要三个基本步骤:
Now that the basic interfaces for reading in flat files have been defined, it becomes clear that three basic steps are required:
-
Read one line from the file.
-
Pass the
String
line into theLineTokenizer#tokenize()
method to retrieve aFieldSet
. -
Pass the
FieldSet
returned from tokenizing to aFieldSetMapper
, returning the result from theItemReader#read()
method.
上面描述的两个接口代表两个独立的任务:将行转换为 FieldSet
和将 FieldSet
映射到域对象。因为 LineTokenizer
的输入与 LineMapper
的输入(一行)相匹配,并且 FieldSetMapper
的输出与 LineMapper
的输出相匹配,所以提供了使用 LineTokenizer
和 FieldSetMapper
的默认实现。DefaultLineMapper
,如下面的类定义所示,表示大多数用户需要的行为:
The two interfaces described above represent two separate tasks: converting a line into a
FieldSet
and mapping a FieldSet
to a domain object. Because the input of a
LineTokenizer
matches the input of the LineMapper
(a line), and the output of a
FieldSetMapper
matches the output of the LineMapper
, a default implementation that
uses both a LineTokenizer
and a FieldSetMapper
is provided. The DefaultLineMapper
,
shown in the following class definition, represents the behavior most users need:
public class DefaultLineMapper<T> implements LineMapper<>, InitializingBean {
private LineTokenizer tokenizer;
private FieldSetMapper<T> fieldSetMapper;
public T mapLine(String line, int lineNumber) throws Exception {
return fieldSetMapper.mapFieldSet(tokenizer.tokenize(line));
}
public void setLineTokenizer(LineTokenizer tokenizer) {
this.tokenizer = tokenizer;
}
public void setFieldSetMapper(FieldSetMapper<T> fieldSetMapper) {
this.fieldSetMapper = fieldSetMapper;
}
}
上述功能是在默认实现中提供的,而不是内置在读取器自身中(如框架的先前版本中所做的那样),以允许用户在控制解析过程中享有更大的灵活性,尤其是在需要访问原始行的情况下。
The above functionality is provided in a default implementation, rather than being built into the reader itself (as was done in previous versions of the framework) to allow users greater flexibility in controlling the parsing process, especially if access to the raw line is needed.
Simple Delimited File Reading Example
以下示例演示如何使用实际的域场景来读取平面文件。这个特定的批处理作业从以下文件中读取足球运动员:
The following example illustrates how to read a flat file with an actual domain scenario. This particular batch job reads in football players from the following file:
ID,lastName,firstName,position,birthYear,debutYear "AbduKa00,Abdul-Jabbar,Karim,rb,1974,1996", "AbduRa00,Abdullah,Rabih,rb,1975,1999", "AberWa00,Abercrombie,Walter,rb,1959,1982", "AbraDa00,Abramowicz,Danny,wr,1945,1967", "AdamBo00,Adams,Bob,te,1946,1969", "AdamCh00,Adams,Charlie,wr,1979,2003"
这个文件的内容被映射到以下 Player
域对象:
The contents of this file are mapped to the following
Player
domain object:
public class Player implements Serializable {
private String ID;
private String lastName;
private String firstName;
private String position;
private int birthYear;
private int debutYear;
public String toString() {
return "PLAYER:ID=" + ID + ",Last Name=" + lastName +
",First Name=" + firstName + ",Position=" + position +
",Birth Year=" + birthYear + ",DebutYear=" +
debutYear;
}
// setters and getters...
}
要将 FieldSet
映射到 Player
对象,需要定义一个返回球员的 FieldSetMapper
,如下面的示例所示:
To map a FieldSet
into a Player
object, a FieldSetMapper
that returns players needs
to be defined, as shown in the following example:
protected static class PlayerFieldSetMapper implements FieldSetMapper<Player> {
public Player mapFieldSet(FieldSet fieldSet) {
Player player = new Player();
player.setID(fieldSet.readString(0));
player.setLastName(fieldSet.readString(1));
player.setFirstName(fieldSet.readString(2));
player.setPosition(fieldSet.readString(3));
player.setBirthYear(fieldSet.readInt(4));
player.setDebutYear(fieldSet.readInt(5));
return player;
}
}
然后可以通过正确构造 FlatFileItemReader
并调用 read
来读取文件,如下面的示例所示:
The file can then be read by correctly constructing a FlatFileItemReader
and calling
read
, as shown in the following example:
FlatFileItemReader<Player> itemReader = new FlatFileItemReader<>();
itemReader.setResource(new FileSystemResource("resources/players.csv"));
DefaultLineMapper<Player> lineMapper = new DefaultLineMapper<>();
//DelimitedLineTokenizer defaults to comma as its delimiter
lineMapper.setLineTokenizer(new DelimitedLineTokenizer());
lineMapper.setFieldSetMapper(new PlayerFieldSetMapper());
itemReader.setLineMapper(lineMapper);
itemReader.open(new ExecutionContext());
Player player = itemReader.read();
每次调用 read
都将从文件中的每一行返回一个新的 Player
对象。当到达文件结尾时,将返回 null
。
Each call to read
returns a new
Player
object from each line in the file. When the end of the file is
reached, null
is returned.
Mapping Fields by Name
DelimitedLineTokenizer
和 FixedLengthTokenizer
提供的其他一个功能类似于 JDBC ResultSet
的功能。可以将字段名称注入到这两个 LineTokenizer
实现中,以提高映射函数的可读性。首先,将平面文件中的所有字段的列名注入到分词器中,如以下示例所示:
There is one additional piece of functionality that is allowed by both
DelimitedLineTokenizer
and FixedLengthTokenizer
and that is similar in function to a
JDBC ResultSet
. The names of the fields can be injected into either of these
LineTokenizer
implementations to increase the readability of the mapping function.
First, the column names of all fields in the flat file are injected into the tokenizer,
as shown in the following example:
tokenizer.setNames(new String[] {"ID", "lastName", "firstName", "position", "birthYear", "debutYear"});
FieldSetMapper
可以按如下方式使用此信息:
A FieldSetMapper
can use this information as follows:
public class PlayerMapper implements FieldSetMapper<Player> {
public Player mapFieldSet(FieldSet fs) {
if (fs == null) {
return null;
}
Player player = new Player();
player.setID(fs.readString("ID"));
player.setLastName(fs.readString("lastName"));
player.setFirstName(fs.readString("firstName"));
player.setPosition(fs.readString("position"));
player.setDebutYear(fs.readInt("debutYear"));
player.setBirthYear(fs.readInt("birthYear"));
return player;
}
}
Automapping FieldSets to Domain Objects
对于大多数人来说,必须编写一个特定的 FieldSetMapper
与编写一个特定的 RowMapper
以获得 JdbcTemplate
一样麻烦。Spring Batch 通过提供一个 FieldSetMapper
来简化此过程,该 FieldSetMapper
使用 JavaBean 规范,通过将字段名与对象上的 setter 进行匹配来自动映射字段。
For many, having to write a specific FieldSetMapper
is equally as cumbersome as writing
a specific RowMapper
for a JdbcTemplate
. Spring Batch makes this easier by providing
a FieldSetMapper
that automatically maps fields by matching a field name with a setter
on the object using the JavaBean specification.
- Java
-
仍然使用足球示例,
BeanWrapperFieldSetMapper
配置看上去如 Java 中的以下代码片段所示:
Again using the football example, the BeanWrapperFieldSetMapper
configuration looks like
the following snippet in Java:
@Bean
public FieldSetMapper fieldSetMapper() {
BeanWrapperFieldSetMapper fieldSetMapper = new BeanWrapperFieldSetMapper();
fieldSetMapper.setPrototypeBeanName("player");
return fieldSetMapper;
}
@Bean
@Scope("prototype")
public Player player() {
return new Player();
}
- XML
-
仍然使用足球示例,
BeanWrapperFieldSetMapper
配置看上去如 XML 中的以下代码片段所示:
Again using the football example, the BeanWrapperFieldSetMapper
configuration looks like
the following snippet in XML:
<bean id="fieldSetMapper"
class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper">
<property name="prototypeBeanName" value="player" />
</bean>
<bean id="player"
class="org.springframework.batch.samples.domain.Player"
scope="prototype" />
对于 FieldSet
中的每个条目,映射器都在新实例的 Player
对象上查找一个对应的 setter(因此需要原型作用域),与 Spring 容器查找与属性名称匹配的 setter 的方式相同。FieldSet
中的每个可用字段都已映射,并且返回生成的 Player
对象,而不需要任何代码。
For each entry in the FieldSet
, the mapper looks for a corresponding setter on a new
instance of the Player
object (for this reason, prototype scope is required) in the
same way the Spring container looks for setters matching a property name. Each available
field in the FieldSet
is mapped, and the resultant Player
object is returned, with no
code required.
Fixed Length File Formats
到目前为止,主要详细讨论了分隔文件。然而,它们只占文件读取图的一半。许多使用平面文件的组织都使用定长格式。接下来是一个定长文件示例:
So far, only delimited files have been discussed in much detail. However, they represent only half of the file reading picture. Many organizations that use flat files use fixed length formats. An example fixed length file follows:
UK21341EAH4121131.11customer1 UK21341EAH4221232.11customer2 UK21341EAH4321333.11customer3 UK21341EAH4421434.11customer4 UK21341EAH4521535.11customer5
虽然这看起来像一个大字段,但实际上它表示 4 个不同的字段:
While this looks like one large field, it actually represent 4 distinct fields:
-
ISIN: Unique identifier for the item being ordered - 12 characters long.
-
Quantity: Number of the item being ordered - 3 characters long.
-
Price: Price of the item - 5 characters long.
-
Customer: ID of the customer ordering the item - 9 characters long.
在配置 FixedLengthLineTokenizer
时,每个长度都必须以范围的形式提供。
When configuring the FixedLengthLineTokenizer
, each of these lengths must be provided
in the form of ranges.
- Java
-
以下示例显示如何在
FixedLengthLineTokenizer
中定义范围:
The following example shows how to define ranges for the FixedLengthLineTokenizer
in
- Java
-
Java Configuration
@Bean public FixedLengthTokenizer fixedLengthTokenizer() { FixedLengthTokenizer tokenizer = new FixedLengthTokenizer(); tokenizer.setNames("ISIN", "Quantity", "Price", "Customer"); tokenizer.setColumns(new Range(1, 12), new Range(13, 15), new Range(16, 20), new Range(21, 29)); return tokenizer; }
- XML
-
以下示例显示如何在 XML 中定义
FixedLengthLineTokenizer
的范围:
The following example shows how to define ranges for the FixedLengthLineTokenizer
in
XML:
<bean id="fixedLengthLineTokenizer"
class="org.springframework.batch.item.file.transform.FixedLengthTokenizer">
<property name="names" value="ISIN,Quantity,Price,Customer" />
<property name="columns" value="1-12, 13-15, 16-20, 21-29" />
</bean>
由于 FixedLengthLineTokenizer
使用前面讨论过的相同 LineTokenizer
接口,因此它返回的 FieldSet
与使用分隔符返回的 FieldSet
相同。这允许使用相同的方法来处理它的输出,例如使用 BeanWrapperFieldSetMapper
。
Because the FixedLengthLineTokenizer
uses the same LineTokenizer
interface as
discussed earlier, it returns the same FieldSet
as if a delimiter had been used. This
allows the same approaches to be used in handling its output, such as using the
BeanWrapperFieldSetMapper
.
为了支持范围的上述语法,需要在 ApplicationContext
中配置一个专门的属性编辑器 RangeArrayPropertyEditor
。然而,此 bean 在使用了批命名空间的 ApplicationContext
中会自动声明。
Supporting the preceding syntax for ranges requires that a specialized property editor,
RangeArrayPropertyEditor
, be configured in the ApplicationContext
. However, this bean
is automatically declared in an ApplicationContext
where the batch namespace is used.
由于 FixedLengthLineTokenizer
使用前面讨论过的相同 LineTokenizer
接口,因此它返回的 FieldSet
与使用分隔符返回的 FieldSet
相同。这允许使用相同的方法来处理它的输出,例如使用 BeanWrapperFieldSetMapper
。
Because the FixedLengthLineTokenizer
uses the same LineTokenizer
interface as
discussed above, it returns the same FieldSet
as if a delimiter had been used. This
lets the same approaches be used in handling its output, such as using the
BeanWrapperFieldSetMapper
.
Multiple Record Types within a Single File
到目前为止,所有的文件读取示例都对简单性做了一个关键的假设:文件中所有的记录都有相同的格式。然而,情况可能并非总是如此。一个文件非常可能包含有不同格式的记录,需要对它们进行不同的标记化和映射到不同的对象。以下文件摘录对此进行了说明:
All of the file reading examples up to this point have all made a key assumption for simplicity’s sake: all of the records in a file have the same format. However, this may not always be the case. It is very common that a file might have records with different formats that need to be tokenized differently and mapped to different objects. The following excerpt from a file illustrates this:
USER;Smith;Peter;;T;20014539;F LINEA;1044391041ABC037.49G201XX1383.12H LINEB;2134776319DEF422.99M005LI
在这个文件中,我们有三种类型的记录,“USER”、“LINEA”和“LINEB”。一条“USER”行对应一个 User
对象。“LINEA”和“LINEB”都对应 Line
对象,尽管“LINEA”的信息比“LINEB”多。
In this file we have three types of records, "USER", "LINEA", and "LINEB". A "USER" line
corresponds to a User
object. "LINEA" and "LINEB" both correspond to Line
objects,
though a "LINEA" has more information than a "LINEB".
ItemReader
单独读取每一行,但是我们必须指定不同的 LineTokenizer
和 FieldSetMapper
对象,以便 ItemWriter
接收正确的条目。PatternMatchingCompositeLineMapper
通过允许将模式映射到 LineTokenizers
和模式映射到 FieldSetMappers
进行配置,从而简化了此过程。
The ItemReader
reads each line individually, but we must specify different
LineTokenizer
and FieldSetMapper
objects so that the ItemWriter
receives the
correct items. The PatternMatchingCompositeLineMapper
makes this easy by allowing maps
of patterns to LineTokenizers
and patterns to FieldSetMappers
to be configured.
- Java
-
Java Configuration
@Bean public PatternMatchingCompositeLineMapper orderFileLineMapper() { PatternMatchingCompositeLineMapper lineMapper = new PatternMatchingCompositeLineMapper(); Map<String, LineTokenizer> tokenizers = new HashMap<>(3); tokenizers.put("USER*", userTokenizer()); tokenizers.put("LINEA*", lineATokenizer()); tokenizers.put("LINEB*", lineBTokenizer()); lineMapper.setTokenizers(tokenizers); Map<String, FieldSetMapper> mappers = new HashMap<>(2); mappers.put("USER*", userFieldSetMapper()); mappers.put("LINE*", lineFieldSetMapper()); lineMapper.setFieldSetMappers(mappers); return lineMapper; }
- XML
-
以下示例显示如何在 XML 中定义
FixedLengthLineTokenizer
的范围:
The following example shows how to define ranges for the FixedLengthLineTokenizer
in
XML:
<bean id="orderFileLineMapper"
class="org.spr...PatternMatchingCompositeLineMapper">
<property name="tokenizers">
<map>
<entry key="USER*" value-ref="userTokenizer" />
<entry key="LINEA*" value-ref="lineATokenizer" />
<entry key="LINEB*" value-ref="lineBTokenizer" />
</map>
</property>
<property name="fieldSetMappers">
<map>
<entry key="USER*" value-ref="userFieldSetMapper" />
<entry key="LINE*" value-ref="lineFieldSetMapper" />
</map>
</property>
</bean>
在这个示例中,“LINEA”和“LINEB”有单独的 LineTokenizer
实例,但它们都使用相同的 FieldSetMapper
。
In this example, "LINEA" and "LINEB" have separate LineTokenizer
instances, but they both use
the same FieldSetMapper
.
PatternMatchingCompositeLineMapper
使用 PatternMatcher#match
方法为每行选择正确的委托。PatternMatcher
允许两个具有特殊含义的通配符:问号(“?”)恰好匹配一个字符,而星号(“”)匹配零个或多个字符。请注意,在前面的配置中,所有模式都以星号结尾,使其有效地成为行的前缀。PatternMatcher
始终匹配最具体的可能模式,无论在配置中的顺序如何。因此,如果“LINE”和“LINEA*”都列为模式,“LINEA”将匹配模式“LINEA*”,而“LINEB”将匹配模式“LINE*”。此外,单个星号(“*”)可以通过匹配任何其他模式未匹配的行而充当默认值。
The PatternMatchingCompositeLineMapper
uses the PatternMatcher#match
method
in order to select the correct delegate for each line. The PatternMatcher
allows for
two wildcard characters with special meaning: the question mark ("?") matches exactly one
character, while the asterisk ("") matches zero or more characters. Note that, in the
preceding configuration, all patterns end with an asterisk, making them effectively
prefixes to lines. The PatternMatcher
always matches the most specific pattern
possible, regardless of the order in the configuration. So if "LINE" and "LINEA*" were
both listed as patterns, "LINEA" would match pattern "LINEA*", while "LINEB" would match
pattern "LINE*". Additionally, a single asterisk ("*") can serve as a default by matching
any line not matched by any other pattern.
- Java
-
以下示例展示了如何在 Java 中匹配其他任何模式都未匹配的行:
The following example shows how to match a line not matched by any other pattern in Java:
...
tokenizers.put("*", defaultLineTokenizer());
...
- XML
-
以下示例展示了如何在 XML 中匹配其他任何模式都未匹配的行:
The following example shows how to match a line not matched by any other pattern in XML:
<entry key="*" value-ref="defaultLineTokenizer" />
还存在可用于仅进行标记化的 PatternMatchingCompositeLineTokenizer
。
There is also a PatternMatchingCompositeLineTokenizer
that can be used for tokenization
alone.
平面文件通常包含跨越多行的记录。若要处理这种情况,需要更复杂的策略。可在 multiLineRecords
采样中找到此常见模式的演示。
It is also common for a flat file to contain records that each span multiple lines. To
handle this situation, a more complex strategy is required. A demonstration of this
common pattern can be found in the multiLineRecords
sample.
Exception Handling in Flat Files
在标记化行时,可能会引发异常的情况有很多。许多平面文件不完善,包含格式不正确的记录。许多用户选择在记录该问题、原始行和行号的同时跳过这些错误行。这些日志随后可以通过人工或其他批处理作业来检查。因此,Spring Batch 提供了一个处理解析异常的异常层次结构:FlatFileParseException
和 FlatFileFormatException
。当尝试读取文件时遇到任何错误时,FlatFileItemReader
抛出 FlatFileParseException
。FlatFileFormatException
由 LineTokenizer
接口的实现抛出,表明在标记化时遇到了更具体的错误。
There are many scenarios when tokenizing a line may cause exceptions to be thrown. Many
flat files are imperfect and contain incorrectly formatted records. Many users choose to
skip these erroneous lines while logging the issue, the original line, and the line
number. These logs can later be inspected manually or by another batch job. For this
reason, Spring Batch provides a hierarchy of exceptions for handling parse exceptions:
FlatFileParseException
and FlatFileFormatException
. FlatFileParseException
is
thrown by the FlatFileItemReader
when any errors are encountered while trying to read a
file. FlatFileFormatException
is thrown by implementations of the LineTokenizer
interface and indicates a more specific error encountered while tokenizing.
IncorrectTokenCountException
DelimitedLineTokenizer
和 FixedLengthLineTokenizer
都能够指定用于创建 FieldSet
的列名。但是,如果列名数量与标记化行时找到的列数量不匹配,则无法创建 FieldSet
,并且会抛出 IncorrectTokenCountException
,其中包含遇到的标记数量和期望数量,如下例所示:
Both DelimitedLineTokenizer
and FixedLengthLineTokenizer
have the ability to specify
column names that can be used for creating a FieldSet
. However, if the number of column
names does not match the number of columns found while tokenizing a line, the FieldSet
cannot be created, and an IncorrectTokenCountException
is thrown, which contains the
number of tokens encountered, and the number expected, as shown in the following example:
tokenizer.setNames(new String[] {"A", "B", "C", "D"});
try {
tokenizer.tokenize("a,b,c");
}
catch (IncorrectTokenCountException e) {
assertEquals(4, e.getExpectedCount());
assertEquals(3, e.getActualCount());
}
由于将标记器配置为使用 4 个列名但在文件中仅找到 3 个标记,因此抛出了 IncorrectTokenCountException
。
Because the tokenizer was configured with 4 column names but only 3 tokens were found in
the file, an IncorrectTokenCountException
was thrown.
IncorrectLineLengthException
以固定长度格式编排的文件在解析时具有其他要求,因为与定界格式不同,每列必须严格遵循其预定义宽度。如果总行长不等于此列的最宽值,则会抛出异常,如下例所示:
Files formatted in a fixed-length format have additional requirements when parsing because, unlike a delimited format, each column must strictly adhere to its predefined width. If the total line length does not equal the widest value of this column, an exception is thrown, as shown in the following example:
tokenizer.setColumns(new Range[] { new Range(1, 5),
new Range(6, 10),
new Range(11, 15) });
try {
tokenizer.tokenize("12345");
fail("Expected IncorrectLineLengthException");
}
catch (IncorrectLineLengthException ex) {
assertEquals(15, ex.getExpectedLength());
assertEquals(5, ex.getActualLength());
}
上述标记器的配置范围为:1-5、6-10 和 11-15。因此,行的总长度为 15。但是,在前面的示例中,传递了长度为 5 的行,导致抛出 IncorrectLineLengthException
。在此处抛出异常而不是仅映射第一列,能够尽早使行的处理失败,且包含比在 FieldSetMapper
中读取第 2 列时失败所包含的更多信息。但是,存在行的长度并非始终恒定的情况。因此,可以通过“严格”属性关闭行长度验证,如下例所示:
The configured ranges for the tokenizer above are: 1-5, 6-10, and 11-15. Consequently,
the total length of the line is 15. However, in the preceding example, a line of length 5
was passed in, causing an IncorrectLineLengthException
to be thrown. Throwing an
exception here rather than only mapping the first column allows the processing of the
line to fail earlier and with more information than it would contain if it failed while
trying to read in column 2 in a FieldSetMapper
. However, there are scenarios where the
length of the line is not always constant. For this reason, validation of line length can
be turned off via the 'strict' property, as shown in the following example:
tokenizer.setColumns(new Range[] { new Range(1, 5), new Range(6, 10) });
tokenizer.setStrict(false);
FieldSet tokens = tokenizer.tokenize("12345");
assertEquals("12345", tokens.readString(0));
assertEquals("", tokens.readString(1));
前面的示例几乎与之前的示例相同,但调用了 tokenizer.setStrict(false)
。此设置指示标记器在标记化行时不强制执行行长度。现在,FieldSet
已正确创建并返回。但是,它仅为剩余值包含空标记。
The preceding example is almost identical to the one before it, except that
tokenizer.setStrict(false)
was called. This setting tells the tokenizer to not enforce
line lengths when tokenizing the line. A FieldSet
is now correctly created and
returned. However, it contains only empty tokens for the remaining values.