Tuesday, November 23, 2010

Google to buy Groupon?

Words came out these days that google is going to buy Groupon. Some analysis about this basically suggests that it can integrate the Groupon ads into its search results, youtube videos etc. It's kind of a surprise to me as I had thought facebook would be the final buyer, especially as facebook is putting special effort into location based services. It also looked to me promising as Groupon is essentially a collective buying site, and if integrated with facebook's socail model, it's easy to add recommendation/reputation context on that. Maybe facebook is doing that by their own? Anyway, let's seat and see how things will be going.

Labels: ,

Monday, November 15, 2010

privacy, privacy, cost of privacy ...

In social network, privacy is always one of the major concern. Google just got another lesson for their negligent(?): Google settles Google Buzz privacy suit for $8.5 million donation

Labels:

Sunday, November 07, 2010

Era of Big Data

Since the acquisition of DATAllegro from Microsoft in 2008, there are more big decisions this year: EMC bought GreenPlum, IBM bought Netezza, and Oracle upgrades Exadata(interestingly, this is announced by Mark Hurd who joined Oracle as the new president no more than half month). All of these highlight the incoming of the era of big data, and industry's leading companies' strong desire to expand their large data management and business analysis.
Here're several interesting links:
Big Data Means Big Sales
EMC's launches Greenplum appliance

Big data not only comes from the quick expanding web activities we are encountering everyday from facebook, twitter, amazon etc. The traditional industries are also generating huge amount of data every minute with the help of the latest technologies. Sensor and RFID technology has been widely used by giant companies such as walmart, target to help collect data and enable more intelligent supply chain management. And I also went across the following vision:
Sensor Networks Top Social Networks for Big Data

As visioned we're soon entering the Exa- and Zetta-byte age in the next couple of years. The imminent future of big data era calls for more aggressive advances in big data management and sharing, more intelligent and effective business analytics, and the security and privacy primitives associated with all of them.

Labels: ,

Saturday, November 06, 2010

Access Control

After a user is authenticated and logon to a system, its access to resources on a computer or network system are controlled by access control modules.

Discretionary Access Control(DAC)
In a DAC model, a subject has complete control over the objects that it owns and the programs that it executes. Owner associates each of its objects with an access control list (ACL), containing a list of users and their level of access to this object. DAC is based on the owner's granting and revoking of privileges. Access to an resource is denied by default unless explicitly authorized. Most of today's OS are using DAC model.

The key weakness of DAC is that it suffers from Trojan horse attacks.

Mandatory Access Control(MAC)
MAC is the most strict of all levels of control. The MAC model targeted for systems in which confidentiality has the highest priority, such as military or government agencies. In a MAC enforced system, both subjects and objects will get assigned clearance levels(security labels). The administrator takes control of security label defintion and assignment. Access to objects are constrained by policies on the security clearance, which are also defined by administrator. The general access rule is no read up, no write down following the Bell-Lapadula Model, but it's also possible to exptend and define dedicate rules depending on the practical security requirements. MAC is fine-grained and can provide row or column level access control.

Often seen as the most secure access control environment, MAC also requires extra effort in pre-planning in order to be effectively and securely implemented. It also calls for continuous system management overhead to control new users, objects, and changes of security label defintions.

Oracle 9i has implemented label security to meet the MAC requirements and provide row level access control, and hierarchy labels are coded as numeric values. DB2 also provides LBAC to provide MAC for both row and column. The security label is composed of one or more security label components of three types: arrays(hierarchy), sets and trees.

Role Based Access Control (RBAC)
In a RBAC system, user also doesn't have discretionary access to objects. Instead, administrator create roles with a collection of permissions for different job functions or responsibilities. Each user will be assigned to one or more roles, and delegated all the privileges associated with that role. RBAC greatly simplifies the management of individual user rights and authorizations.

Many database systems have some implementation of RBAC, including Teradata, Oracle, DB2, SQL Server.

Labels: ,

Thursday, November 04, 2010

Shared Nothing Architecture

Shared nothing system greatly reduces the resource contention for memory, locks, or processors. As pointed out by DeWitt et al., among the three widely used approaches, shared memory is the least scallable, shared disk medium, and shared nothing is most scalable. A shared nothing system can scale almost linearly and infinitely, simply by adding more inexpensive nodes. Shared nothing is now prevalent in the Data Warehousing space due to its potential for scaling.

As one of the earliest implementation, in teradata, each AMP virtual processor(vproc) manages its own dedicated portition of the system's disk space(vdisk, which can be multiple disk array ranks). Rows are distributed to the AMPs according to the hash of the primary index(PI). For NoPI table supported from TD 13.0, it either hashes on the Query ID for a row, or it uses a different algorithm to assign the row to its home AMP. The unconditional parallelism and linearly expandability makes its leading position in enterprise data warehousing.

Nowadays the shared nothing architecture is adopted by most high performance scalable DBMSs, including Teradata, Netezza, Greenplum, DB2 and Vertica. It is also used by most of the high-end e-commerce platforms, inclusing Amazon, Yahoo, Google, and Facebook.

In DB2 UDB Enterprise-Extended Edition (EEE), partition key is chosen as one or more columns and hash of the partition key determines which node/node group a row should be sent to.

Oracle is a shared-disk approach. In Oracle shared nothing is at logical level. Once the degree of parallelism is chosen as a power of 2, number of partitions are decided and partitions are generated by the range - hash partitioning.

Cons: Shared Nothing Architectures takes longer to respond to queries that involve joins over large data sets from different partitions. For example, in Teradata OLTP is not efficient, CPU cycles are distributed to several AMPs and PE, PEs may get easily congested by massive OLTP requests.

Labels: ,

Tuesday, November 02, 2010

Always On - Aster Data example

on June 29th, 2010, Google's Adwords stopped serving Ads sometime around 1:40pm PST and lasted for about 3 hours. The estimated cost is about $7.8 million. For Amazon or ebay, even some shoppers may come back later, they still lose impulse buyers, which counts for about millions per hour.

Currently zero downtime practices have been widely deployed for data migration. But for database/data warehouse, it is still a challenging problem. In general the system downtime can be classified as planned and unplanned. As 24x7 availability is becoming more and more critical for Data warehouse systems, it is expected that system is always on during the planned or unplanned downtime.

As claimed by Aster Data, they built solutions upon the Recovery-Oriented Computing to achieve this goal. The basic functionalities include:

- In-cluster replication and transparent fail-over
Data replicas are placed across the cluster, and server failure are transparently transferred to replicas within the cluster.

- Self diagnostics
If permanent failure, creating new replicas on existing or new servers without downtime. If transient failure, resync after the server recovers.

- Network aggregation
Multiple network hardware to provide parallelism and redundancy.

- Separation of duty
Dedicated servers for loading/exporting data, and backup/restore.

- Workload prediction
Policy-driven tools to manage priority of workloads and dynamically assign resources.

Labels: ,