Sunday, October 27, 2013

Column-oriented Database System, and query execution

Column database is evolving very fast these days, accompanied with the requirement in storing and querying "big data". I've made some notes here based on Daniel Abadi's doctor dissertation, which wins the Sigmod Jim Gray Doctoral Dissertation Award. It also contains materials from the VLDB 09' tutorial on columnar dbms.

In a column store db,  queries only read in relevant attributes, but tuple writes need multiple seeks. So generally it's good for read-intensive data warehouses, which is characterized by lots of column scans and batch writes.

Simplest store solution is to use tuple ID: each column is a separate table, with table header and tuple id for each value. Optimization is very expensive representing in this way. We can remove the tuple id and store data in its original order, and table header is also stored in separate column. For each query, we first join all accessed attributes into traditional row tuples, then apply row optimization.
 
Observing that I/O cost is the most expensive, the opportunity for column store is to utilize the high data value locality and apply compression. As an example below, for sorted data we can save it as (value, start idx, length)  tuples to avoid repetition.
We can apply encoding for values within a limited domain. In general, column data are from the same domain, thus contain significantly less entropy than row store, and gives a better compression efficiency: for row store, compression ratio is often 1:3, while for column store we can do 1:10. We can also use extra space to store multiple copies of data in different sort orders, to further speed up query execution. Sorted columns compress better and are faster for range queries. We can also store column data in a BTree, while compacting the values in the leaf nodes.
Columns store offers more opportunities in block and vectorized processing. Queries can operate on compressed data directly and avoid decompression, hereby we trade off I/O for CPU. Here's the example query on the above data:
A tuple can be materialized either at the beginning of of query plan. However, the most efficient way is to operate on the columns as long as possible. Basically we can
- first compute the predicates on the selected columns,
- then join the result bit array or bloom filter,
- then do projection on the selection attribute,
- finally do aggregation on the projection.

Sunday, October 20, 2013

Dremel: PB analysis in 3 seconds

Google published its paper "Dremel: Interactive Analysis of WebScaleDatasets" in VLDB 2010. The purpose of this work is to perform interactive data analysis at scale, on shared clusters of commodity machines, with fast speed and fault tolerance. It access data in situ (in place), and execute queries on such data at a fraction of time compare with Map Reduce. But it's not a replacement of MR, instead it's used to analyze outputs of MR pipelines. A typical use case is to first use MR and pre-process the input data containing signals, then run some trial interactive queries in Dremel in seconds to verify the signal and improve the algorithm. Finally come out with a mature long term job.

Dremel is built on the idea of web search and parallel DBMSs: a query gets pushed down the tree, and is rewritten at each step. Then replies are aggregated and assembled as result. It also provides SQL-like language to express ad-hoc queries, execute them natively without translating into MR jobs as pig or hive.

Dremel adopts a nested columnar storage model, claiming the properties of lossless representation, fast encoding, and efficient record assembly. It uses the Protocol Buffers as the serialization approach. The syntax is as:
t=dom|<A1:t[|?],...,An:t[|?]>
where * denote repeated fields, and ? denote optional fields. The column striped representation is based on the notion of Repetition Levels and Definition Levels:
- Repetition Levels: the value is repeated at which repeated field
- Definition Levels: how many fields that are undefined(optional or repeated) are actually presented.
Here's an example from the paper:
 http://yankaycom-wordpress.stor.sinaapp.com/uploads/2012/12/13458646314107_f.jpg


The paper gives formulas, and computation of some example entries. I'd like to give a complete explanation of the values in Name.Language.Code and Links.Backword here:
- Name.Language.Code: 
> (en-us, 0, 2): it's the first Code in document r1, so r=0. code is required, so we only need to look at its parents for d. Parent is Name.Language, so d=2
> (en, 2, 2): it's in 2nd language, so r=2. d=2 as above.
> (null, 1, 1): the next Name has only url field, so Code is null. This null Code is in the 1st language, so r=1. Only Name is present, so d=1
> (en-gb, 1, 2): it's in the 1st language of its Name, so r=1. Name.Language exists, so d=2
> (null, 0, 1): this is document r2, the 1st name has no Code, so it's null. r=0 denotes a new record.  Only Name exists, so d=1.

- Links.Backward:
> (null, 0, 1): r=0 as it's the 1st in doc r1. But it has no Backward value, so null. d=1 as Links exists.
> (10, 0, 2): r=0 as we're finished with doc r1, and start doc r2. Links.Backwards are  optional/repeated, so d=2.
> (30, 1, 2): r=1 as 30 is in 1st Links. d=2 as above.

When storing it, each column is stored as a set of blocks, containing the r and d values, and compressed filed values. Many compression techniques can be used here to reduce the storage cost, by utilizing the properties of the encoding algorithm.

To assembly the records efficiently, the key idea is to use a FSM labeled by repetition labels. The FSM is traversed once for each record. Record assembly is for record-oriented data processing such as MR. Dremel's query execution itself can bypass this step.

Dremel adopts a SQL-like language. It inputs one or more tables with schemas, and output a nested table and its output schema.  The selection operator prunes away branches of the tree, and the projection operator emits a value at the same level of nesting, as the most repeated input field.

In query execution phase, Dremel adopts a multi-level serving tree. Root server receives incoming queries, reads metadata from tables, and routes queries to next level. The root determines all horizontal partitions of the table, and intermediate servers performs parallel aggregation of partial results. The query dispatcher maintains a histogram of tablet processing times, and can re-dispatch the job based on the system load. Each server has an internal execution tree, corresponds to a physical query execution plan. The selection-projection-aggregate queries is specially optimized as it's commonly used. It can do within record and cross record aggregation directly on column data, and bypass record assembly entirely.