Wednesday, 3 June 2009
Data Partitioning using GigaSpaces
Each Data Grid instance (partition) holds a different subset of the objects in the Data Grid. When the objects are written to this Data Grid, they are routed to the proper partition, according to a predefined attribute in the object that acts as the routing index.
When a component (i.e. Data Feeder / processor) write / read / take an object from the data grid it is using the built-in “content based routing” functionality of GigaSpaces which means the object operation would be based on the routing index object property that explicitly points to the relevant partition.
Having said that one comes to the conclusion that the number of partitions can’t change, well that is true. BUT GigaSpaces can still achieve the result of “dynamic re-partitioning” by using its ability to scale out. Looking into dynamic repartitioning the actual requirement is derived from the ability to cope with growing amount of data that naturally consumes growing amount of memory.
Using GigaSpaces you can host any (predefined) number of partitions on a single server and when memory shortage occurs the existing partitions can start the “scale out procedure” utilizing standby / new hardware available on the cloud. In other words: on data peak-loads GigaSpaces is able to relocate partitions to available computational resources using GigaSpaces virtualized runtime (GSC).
From architectural perspective having static predefined number of partitions leads to better data affinity, minimum latency in terms of network hops, natural load sharing model which is more resilient. compared with true dynamic repartitioning one must consider the followings: additional cache instance does not lead to balanced load as the the new instance will slowly fill up equally to the existing partitions which continue to share the load and potentially reach out of memory. secondly in many cases there will be additional network hops to find distributed data (will that be transactional...)
Mickey
Monday, 25 May 2009
Still stuck with JEE application server while using Spring?
GigaSpaces application server offers a great alternative to commercial level application servers which are becoming an over kill as the usage of the full fledged application server resources are no longer needed. It is a common paradox nowadays to see JEE app-servers that actually running Spring…
The migration efforts require some analysis, in many cases such as a standard web application the migration is seamless, other cases such as specific app-server resource usage as transaction management, connection pooling etc’ should be inspected and require some configuration changes.
What are the benefits and what still needs to be resolved?
By simply deploying your web application into GigaSpaces you’ll benefit from advanced, elastic runtime known as GigaSpaces service grid which is using
What can the
Align with your
The
High availability has many levels, normally in JEE it is based on the database HA and a simplistic app-server restart invoked by a watchdog service. Another aspect specific for web application is the session replication which is based on memory replication of the http session objects to the entire application server cluster. GigaSpaces extends the session replication resiliency by using in memory clustering of the application data, this part normally involves some coding based on the complexity of the data model and transactions.
Designing a scalable architecture
Scalable application is measured by its linearity, trying to solve the problem at the database level or using JVM replication layers would not lead to linearity as bottlenecks will just continue to pop-up wherever there is a stateful service / data. There are fundamental requirements involving scalability – first is scaling the entire application stack (not just the data), second is implementing the share-nothing approach (AKA Processing Unit) per business unit.
Critical path approach
I often get the questions/remarks related to in memory data clustering in terms of size and safety perspective, the remark usually goes like: “we have terabyte of data, we can't cache it all, and if we could it would cost too much”
Ideally a processing unit would hold its own service, data and messaging. One must realize that is not always the case, especially in complex business applications or massive data (in which it might not be cost effective). BUT there is a way out, the rule of thumb would be identifying the critical path both in terms of throughput and latency, then use the tools available (such as SpaceAPI and collocation) to crash that latency path and cache (and partition) the data.
The soft-points exist on the shared resources, no matter how optimized the replication mechanism is (“cool concurrent field level”, object level or in memory database binary log replication), you can actually take Amdahl law to the test and find the diminishing throughput per node until you reach the dead-end. If your solution is based on replication, master / slave you'll probably feel it on the 2nd and 3rd nodes...
A short say about cost
Costs are measured by many aspects such as learning curve, time to market, hidden costs, CTO, ROI, ROA etc’ there is a lot of public information about it. What I want to emphasize is future proofing, existing skills, and open source.
Application servers as runtime platform no longer can relay on hardware / database clustering, it is their job to take care of HA and scaling and be able to meet the requirements leveraging exiting hardware. Real-time HA should not be an extended cost. IT is moving towards elasticity (cloud) based on pure cost savings, that environment requires different runtime capabilities in order to leverage the cloud, the main factor is being agnostic to different (on the fly) provisioned hardware (IP, Disk etc’) .
Last say on open source which I use on a daily basis, open source has a dramatic impact on future standards using the bottom-up approach, so when I choose an app-server I want to verify it supports open-source standards as Spring, BUT as this market grows and commercial entities start take part in it you’ll soon find out they are using the “drug dealer approach” giving you something for free, (sometimes a very essential part of your system!) then when it comes to future requirements or moving to production you discover that HA and other feature are not part of the community package. Trying to find out the costs you might discover that there isn’t any list price available….