Documentdb 简明教程
DocumentDB - Partitioning
当数据库开始增长到超过 10GB 时,只需创建新集合,然后在越来越多的集合中传播或分区您的数据,即可轻松扩展。
When your database starts to grow beyond 10GB, you can scale out simply by creating new collections and then spreading or partitioning your data across more and more collections.
迟早,10GB容量的单个集合将不足以容纳你的数据库。现在10GB听起来可能不算是很大数字,但是请记住,我们正在存储JSON文档,其中只是纯文本,而且即便考虑到索引的存储开销,你也可以在10GB中放入许多纯文本文档。
Sooner or later a single collection, which has a 10GB capacity, will not be enough to contain your database. Now 10GB may not sound like a very large number, but remember that we’re storing JSON documents, which is just plain text and you can fit a lot of plain text documents in 10GB, even when you consider the storage overhead for the indexes.
对于可扩展性而言,存储并不是唯一的问题。在某个集合中可用的最大吞吐量为每秒两千五百个请求单元,你可以通过S3集合获得这个吞吐量。因此,如果你需要更高的吞吐量,那么你还需要通过使用多个集合进行分区来扩展。扩展分区也称为 horizontal partitioning 。
Storage isn’t the only concern when it comes to scalability. The maximum throughput available on a collection is two and a half thousand request units per second that you get with an S3 collection. Hence, if you need higher throughput, then you will also need to scale out by partitioning with multiple collections. Scale out partitioning is also called horizontal partitioning.
可以使用许多方法对Azure DocumentDB中的数据分区。以下是最常见策略:
There are many approaches that can be used for partitioning data with Azure DocumentDB. Following are most common strategies −
-
Spillover Partitioning
-
Range Partitioning
-
Lookup Partitioning
-
Hash Partitioning
Spillover Partitioning
溢出分区是最简单的策略,因为它没有分区键。当你对很多事情不确定时,它往往是个不错的开端。你可能不知道你是否需要扩展到单个集合之外,或者你需要添加多少集合,又或者你可能需要多快添加它们。
Spillover partitioning is the simplest strategy because there is no partition key. It’s often a good choice to start with when you’re unsure about a lot of things. You might not know if you’ll even ever need to scale out beyond a single collection or how many collections you may need to add or how fast you may need to add them.
-
Spillover partitioning starts with a single collection and there is no partition key.
-
The collection starts to grow and then grows some more, and then some more, until you start getting close to the 10GB limit.
-
When you reach 90 percent capacity, you spill over to a new collection and start using it for new documents.
-
Once your database scales out to a larger number of collections, you’ll probably want to shift to a strategy that’s based on a partition key.
-
When you do that you’ll need to rebalance your data by moving documents to different collections based on whatever strategy you’re migrating to.
Range Partitioning
最常见的策略之一是范围分区。使用这种方法,你可以确定文档分区键可能落入的值范围,并将文档定向到与该范围相对应的集合中。
One of the most common strategies is range partitioning. With this approach you determine the range of values that a document’s partition key might fall in and direct the document to a collection corresponding to that range.
-
Dates are very typically used with this strategy where you create a collection to hold documents that fall within the defined range of dates. When you define ranges that are small enough, where you’re confident that no collection will ever exceed its 10GB limit. For example, there may be a scenario where a single collection can reasonably handle documents for an entire month.
-
It may also be the case that most users are querying for current data, which would be data for this month or perhaps last month, but users are rarely searching for much older data. So you start off in June with an S3 collection, which is the most expensive collection you can buy and delivers the best throughput you can get.
-
In July you buy another S3 collection to store the July data and you also scale the June data down to a less-expensive S2 collection. Then in August, you get another S3 collection and scale July down to an S2 and June all the way down to an S1. It goes, month after month, where you’re always keeping the current data available for high throughput and older data is kept available at lower throughputs.
-
As long as the query provides a partition key, only the collection that needs to be queried will get queried and not all the collections in the database like it happens with spillover partitioning.
Lookup Partitioning
使用查找分区,你可以定义一个分区映射,根据其分区键将文档路由到特定集合。例如,你可以按区域分区。
With lookup partitioning you can define a partition map that routes documents to specific collections based on their partition key. For example, you could partition by region.
-
Store all US documents in one collection, all European documents in another collection, and all documents from any other region in a third collection.
-
Use this partition map and a lookup partition resolver can figure out which collection to create a document in and which collections to query, based on the partition key, which is the region property contained in each document.
Hash Partitioning
在哈希分区中,分区根据哈希函数的值分配,从而使你可以在多个分区中均匀地分布请求和数据。
In hash partitioning, partitions are assigned based on the value of a hash function, allowing you to evenly distribute requests and data across a number of partitions.
这通常用于对大量不同客户端产生或使用的数据进行分区,并且对于存储用户配置文件、目录项等非常有用。
This is commonly used to partition data produced or consumed from a large number of distinct clients, and is useful for storing user profiles, catalog items, etc.
让我们来看一下使用 .NET SDK 提供的 RangePartitionResolver 对范围分区进行简单分区的示例。
Let’s take a look at a simple example of range partitioning using the RangePartitionResolver supplied by the .NET SDK.
Step 1 − 创建一个新的 DocumentClient,我们将在 CreateCollections 任务中创建两个集合。一个集合将包含用户 ID 以 A 到 M 开头的用户的文档,另一个集合将包含用户 ID 为 N 到 Z 的用户的文档。
Step 1 − Create a new DocumentClient and we will create two collections in CreateCollections task. One will contain documents for users that have user IDs beginning with A through M and the other for user IDs N through Z.
private static async Task CreateCollections(DocumentClient client) {
await client.CreateDocumentCollectionAsync(“dbs/myfirstdb”, new DocumentCollection {
Id = “CollectionAM” });
await client.CreateDocumentCollectionAsync(“dbs/myfirstdb”, new DocumentCollection {
Id = “CollectionNZ” });
}
Step 2 − 为数据库注册范围解析器。
Step 2 − Register the range resolver for the database.
Step 3 − 创建一个新的 RangePartitionResolver<string>,即分区键的数据类型。此构造函数接收两个参数,分区键的属性名称以及字典,该字典是分片映射或分区映射,它只是我们为解析器预定义的范围列表和对应集合。
Step 3 − Create a new RangePartitionResolver<string>, which is the datatype of our partition key. The constructor takes two parameters, the property name of the partition key and a dictionary that is the shard map or partition map, which is just a list of the ranges and corresponding collections that we are predefining for the resolver.
private static void RegisterRangeResolver(DocumentClient client) {
//Note: \uffff is the largest UTF8 value, so M\ufff includes all strings that start with M.
var resolver = new RangePartitionResolver<string>(
"userId", new Dictionary<Range<string>, string>() {
{ new Range<string>("A", "M\uffff"), "dbs/myfirstdb/colls/CollectionAM" },
{ new Range<string>("N", "Z\uffff"), "dbs/myfirstdb/colls/CollectionNZ" },
});
client.PartitionResolvers["dbs/myfirstdb"] = resolver;
}
必须在此处编码尽可能大的 UTF-8 值。否则,第一个范围将不会匹配到任何 M,除了单个 M,同样,第二个范围中的 Z 也是如此。因此,您可以将此编码值视为匹配分区键的通配符。
It’s necessary to encode the largest possible UTF-8 value here. Or else the first range wouldn’t match on any Ms except the one single M, and likewise for Z in the second range. So, you can just think of this encoded value here as a wildcard for matching on the partition key.
Step 4 − 在创建解析器后,使用当前的 DocumentClient 为数据库注册它。为此,只需将其赋给 PartitionResolver 的字典属性即可。
Step 4 − After creating the resolver, register it for the database with the current DocumentClient. To do that just assign it to the PartitionResolver’s dictionary property.
我们将针对数据库而不是集合(像您通常所做的那样)创建和查询文档,解析器将使用此映射将请求路由到适当的集合。
We’ll create and query for documents against the database, not a collection as you normally do, the resolver will use this map to route requests to the appropriate collections.
现在,让我们创建一些文档。首先,我们将为 userId Kirk 创建一个文档,然后为 Spock 创建一个文档。
Now let’s create some documents. First we will create one for userId Kirk, and then one for Spock.
private static async Task CreateDocumentsAcrossPartitions(DocumentClient client) {
Console.WriteLine();
Console.WriteLine("**** Create Documents Across Partitions ****");
var kirkDocument = await client.CreateDocumentAsync("dbs/myfirstdb", new { userId =
"Kirk", title = "Captain" });
Console.WriteLine("Document 1: {0}", kirkDocument.Resource.SelfLink);
var spockDocument = await client.CreateDocumentAsync("dbs/myfirstdb", new { userId =
"Spock", title = "Science Officer" });
Console.WriteLine("Document 2: {0}", spockDocument.Resource.SelfLink);
}
此处的第一个参数是到数据库的自我链接,而不是特定集合。如果没有分区解析器,这是不可能的,但如果有一个,它只会无缝地工作。
The first parameter here is a self-link to the database, not a specific collection. This is not possible without a partition resolver, but with one it just works seamlessly.
如果 RangePartitionResolver 工作正常,两个文档都将保存到数据库 myfirstdb 中,但我们知道 Kirk 存储在 A 到 M 的集合中,而 Spock 存储在 N 到 Z 的集合中。
Both documents were saved to the database myfirstdb, but we know that Kirk is being stored in the collection for A through M and Spock is being stored in the collection for N to Z, if our RangePartitionResolver is working properly.
让我们在 CreateDocumentClient 任务中调用这些文档,如下面的代码所示。
Let’s call these from the CreateDocumentClient task as shown in the following code.
private static async Task CreateDocumentClient() {
// Create a new instance of the DocumentClient
using (var client = new DocumentClient(new Uri(EndpointUrl), AuthorizationKey)) {
await CreateCollections(client);
RegisterRangeResolver(client);
await CreateDocumentsAcrossPartitions(client);
}
}
执行上述代码后,您将收到以下输出。
When the above code is executed, you will receive the following output.
**** Create Documents Across Partitions ****
Document 1: dbs/Ic8LAA==/colls/Ic8LAO2DxAA=/docs/Ic8LAO2DxAABAAAAAAAAAA==/
Document 2: dbs/Ic8LAA==/colls/Ic8LAP12QAE=/docs/Ic8LAP12QAEBAAAAAAAAAA==/
正如所见,由于两个文档位于两个不同的集合中,因此这两个文档的自我链接具有不同的资源 ID。
As seen the self-links of the two documents have different resource IDs because they exist in two separate collections.