Wednesday 3 June 2009

Data Partitioning using GigaSpaces

Data partitioning is the key to reach linear scalability from data growth perspective, here is my take on GigaSpaces partitioning:

Each Data Grid instance (partition) holds a different subset of the objects in the Data Grid. When the objects are written to this Data Grid, they are routed to the proper partition, according to a predefined attribute in the object that acts as the routing index.

When a component (i.e. Data Feeder / processor) write / read / take an object from the data grid it is using the built-in “content based routing” functionality of GigaSpaces which means the object operation would be based on the routing index object property that explicitly points to the relevant partition.

Having said that one comes to the conclusion that the number of partitions can’t change, well that is true. BUT GigaSpaces can still achieve the result of “dynamic re-partitioning” by using its ability to scale out. Looking into dynamic repartitioning the actual requirement is derived from the ability to cope with growing amount of data that naturally consumes growing amount of memory.

Using GigaSpaces you can host any (predefined) number of partitions on a single server and when memory shortage occurs the existing partitions can start the “scale out procedure” utilizing standby / new hardware available on the cloud. In other words: on data peak-loads GigaSpaces is able to relocate partitions to available computational resources using GigaSpaces virtualized runtime (GSC).

From architectural perspective having static predefined number of partitions leads to better data affinity, minimum latency in terms of network hops, natural load sharing model which is more resilient. compared with true dynamic repartitioning one must consider the followings: additional cache instance does not lead to balanced load as the the new instance will slowly fill up equally to the existing partitions which continue to share the load and potentially reach out of memory. secondly in many cases there will be additional network hops to find distributed data (will that be transactional...)


Mickey