Efficient Bulk Insertion into a Distributed Ordered Table
Source:
SIGMOD Conference, Vancouver, BC, Canada (2008)
Abstract:
We study the problem of bulk-inserting records into tables
in a system that horizontally range-partitions data over a
large cluster of shared-nothing machines. Each table partition
contains a contiguous portion of the table’s key range,
and must accept all records inserted into that range. Examples
of such systems include BigTable at Google, and
PNUTS at Yahoo! During bulk inserts into an existing
table, if most of the inserted records end up going into
a small number of data partitions, the obtained throughput
may be very poor due to ineffective use of cluster parallelism.
We propose a novel approach in which a planning phase is
invoked before the actual insertions. By creating new partitions
and intelligently distributing partitions across machines,
the planning phase ensures that the insertion load
will be well-balanced. Since there is a tradeoff between
the cost of moving partitions and the resulting throughput
gain, the planning phase must minimize the sum of partition
movement time and insertion time. We show that this
problem is a variation of NP-hard bin-packing, reduce it to
a problem of packing vectors, and then give a solution with
provable approximation guarantees. We evaluate our approach
on a prototype system deployed on a cluster of 50
machines, and show that it yields significant improvements
over more naive techniques.
Download: