Mickey Alon's Blog

Wednesday, 3 June 2009

Data Partitioning using GigaSpaces

Data partitioning is the key to reach linear scalability from data growth perspective, here is my take on GigaSpaces partitioning:

Each Data Grid instance (partition) holds a different subset of the objects in the Data Grid. When the objects are written to this Data Grid, they are routed to the proper partition, according to a predefined attribute in the object that acts as the routing index.

When a component (i.e. Data Feeder / processor) write / read / take an object from the data grid it is using the built-in “content based routing” functionality of GigaSpaces which means the object operation would be based on the routing index object property that explicitly points to the relevant partition.

Having said that one comes to the conclusion that the number of partitions can’t change, well that is true. BUT GigaSpaces can still achieve the result of “dynamic re-partitioning” by using its ability to scale out. Looking into dynamic repartitioning the actual requirement is derived from the ability to cope with growing amount of data that naturally consumes growing amount of memory.

Using GigaSpaces you can host any (predefined) number of partitions on a single server and when memory shortage occurs the existing partitions can start the “scale out procedure” utilizing standby / new hardware available on the cloud. In other words: on data peak-loads GigaSpaces is able to relocate partitions to available computational resources using GigaSpaces virtualized runtime (GSC).

From architectural perspective having static predefined number of partitions leads to better data affinity, minimum latency in terms of network hops, natural load sharing model which is more resilient. compared with true dynamic repartitioning one must consider the followings: additional cache instance does not lead to balanced load as the the new instance will slowly fill up equally to the existing partitions which continue to share the load and potentially reach out of memory. secondly in many cases there will be additional network hops to find distributed data (will that be transactional...)

Mickey

Monday, 25 May 2009

Still stuck with JEE application server while using Spring?

GigaSpaces application server offers a great alternative to commercial level application servers which are becoming an over kill as the usage of the full fledged application server resources are no longer needed. It is a common paradox nowadays to see JEE app-servers that actually running Spring…

The migration efforts require some analysis, in many cases such as a standard web application the migration is seamless, other cases such as specific app-server resource usage as transaction management, connection pooling etc’ should be inspected and require some configuration changes.

What are the benefits and what still needs to be resolved?

By simply deploying your web application into GigaSpaces you’ll benefit from advanced, elastic runtime known as GigaSpaces service grid which is using SLA driven containers. The elastic runtime is actually application level virtualization (as appose to KVM/VMWare) which does not relay on static IP or specific hardware configuration.

What can the SLA driven container do for you out of the box?

Align with your SLA requirements that can be based on CPU / Mem / custom business logic monitoring beans. In practice that would be specifying how many nodes should serve your application at any specific time with real-time provisioning of your existing commodity servers. No more provisioning guessing game ending in costly over provisioning or worse downtime due to under provisioning

The SLA driven container takes the proactive approach to scale-out on demand based on real-time events such as order-backlog threshold, high CPU/Memory utilization. But what happens if my server hardware just crashed? This is where the self-healing kicks-in. The self-healing is automatically invoked once the SLA was breached by for example any lose of computational resources. In such event the grid service manager will provision a new resource and re-deploy the missing application/service.

High availability has many levels, normally in JEE it is based on the database HA and a simplistic app-server restart invoked by a watchdog service. Another aspect specific for web application is the session replication which is based on memory replication of the http session objects to the entire application server cluster. GigaSpaces extends the session replication resiliency by using in memory clustering of the application data, this part normally involves some coding based on the complexity of the data model and transactions.

Designing a scalable architecture

Scalable application is measured by its linearity, trying to solve the problem at the database level or using JVM replication layers would not lead to linearity as bottlenecks will just continue to pop-up wherever there is a stateful service / data. There are fundamental requirements involving scalability – first is scaling the entire application stack (not just the data), second is implementing the share-nothing approach (AKA Processing Unit) per business unit.

Trying to handle it all normally involves clustering different technologies like messaging, data and service, you'll quickly end up with a complex stack using many moving parts that should be integrated and maintained over time, each with its own fail-over strategy and high skillset required.

Critical path approach

I often get the questions/remarks related to in memory data clustering in terms of size and safety perspective, the remark usually goes like: “we have terabyte of data, we can't cache it all, and if we could it would cost too much”

Ideally a processing unit would hold its own service, data and messaging. One must realize that is not always the case, especially in complex business applications or massive data (in which it might not be cost effective). BUT there is a way out, the rule of thumb would be identifying the critical path both in terms of throughput and latency, then use the tools available (such as SpaceAPI and collocation) to crash that latency path and cache (and partition) the data.

The soft-points exist on the shared resources, no matter how optimized the replication mechanism is (“cool concurrent field level”, object level or in memory database binary log replication), you can actually take Amdahl law to the test and find the diminishing throughput per node until you reach the dead-end. If your solution is based on replication, master / slave you'll probably feel it on the 2nd and 3rd nodes...

A short say about cost

Costs are measured by many aspects such as learning curve, time to market, hidden costs, CTO, ROI, ROA etc’ there is a lot of public information about it. What I want to emphasize is future proofing, existing skills, and open source.

Application servers as runtime platform no longer can relay on hardware / database clustering, it is their job to take care of HA and scaling and be able to meet the requirements leveraging exiting hardware. Real-time HA should not be an extended cost. IT is moving towards elasticity (cloud) based on pure cost savings, that environment requires different runtime capabilities in order to leverage the cloud, the main factor is being agnostic to different (on the fly) provisioned hardware (IP, Disk etc’) .

Last say on open source which I use on a daily basis, open source has a dramatic impact on future standards using the bottom-up approach, so when I choose an app-server I want to verify it supports open-source standards as Spring, BUT as this market grows and commercial entities start take part in it you’ll soon find out they are using the “drug dealer approach” giving you something for free, (sometimes a very essential part of your system!) then when it comes to future requirements or moving to production you discover that HA and other feature are not part of the community package. Trying to find out the costs you might discover that there isn’t any list price available….

Sunday, 12 April 2009

Today’s Utility Information Technology

Today's utility IT calls for a change in perception, simple utility monitoring is just not enough, the existing inefficiencies requires a change, in today's global crisis both in economy and claimant there is a need for solutions that will proactively analyze, detect and react to real-time data produced by electric, gas, and water utilities. smart grids and real time data collection is the first step towards that change.

having all that data available makes you realize there is a lot of inefficiencies like producing electricity voltage even when it is unused at the moment, keeping the water pressure 24/7 while actually it is 50-60% utilized. think how much energy, pollution, costs can be saved if only we had the technology that could respond by analyzing massive data, detecting valuable information and reacting accordingly.

I ran across active insight which provides a real-time detection and reaction to online activities, they seem to focus on complex event processing for http. I realized that today utilities are able to transmit their current data over tcp and thus leverage active insight's capabilities to react.

Active insight's main technological edge is to process high data volume and detect behavioral patterns composed out of complex http streams, field validators, and time based activities. Their latest version added an async/sync service call capabilities which makes it easy to integrate enterprise systems like CRM / BI / Billing in order to react upon detected behavior.

Whether it is a click stream (in web applications) or utility tcp data stream active insight provides a unique way to process that data in real-time, detect value based patterns and react accordingly, achieving 2 important goals: detecting value based patterns in real-time as appose to async batch data processing and secondly closing the loop once a pattern / transition is detected, in other words let the organization / people / systems know about it and do something about it.