What ARE DATAllegro Grid-Enabled DATA Warehouse Appliances?
Each DATAllegro appliance is a MPP system consisting of nodes that, when used together, provide an extremely high-speed solution for very large databases. However, the nodes within a DATAllegro appliance are in fact self-contained database servers running Ingres on SUSE Linux. Therefore, a DATAllegro appliance can be viewed as a highly-specialized grid of servers being pulled together to collectively form a data warehouse appliance.
A DATAllegro grid extends the MPP concept to being a collection of DATAllegro appliances working together to provide the convenience of centralized control, yet maintaining the autonomy and cost benefits of a distributed system. This architecture also scales vertically and horizontally to meet the ever-growing needs of business without impacting performance of the centralized hub.
The grid capitalizes on DATAllegro’s existing open-source, industry-standard architecture. Under the grid architecture, instead of moving or loading data within a single appliance, it occurs directly between nodes in different appliances to maximize parallelism and overall transfer speeds. Such a grid can move data between appliances at over a terabyte per minute, depending on the number of nodes in each appliance.
The DATAllegro grid architecture maintains all metadata (e.g. database object definitions, physical topology, etc.) associated with the grid and automatically maintains data synchronization for all attached appliances in an easy-to-administer fashion.
A closer look at the DATAllegro Grid Architecture
Hub-and-Spoke
A grid of appliances can be used as the basis for any large-scale data warehouse. However, it is particularly suitable for a hub-and-spoke architecture.
A fairly large appliance acts as the hub of a set of data mart (DM) appliances. The hub holds detailed data, probably in a normalized schema, for a number of business units or perhaps the entire enterprise. The hub can be loaded in near real-time or in daily batches. Using a star schema as an example, ETL tools such as Informatica or standard SQL scripts use the detailed data to create fact tables for any number of business units. The fact tables can then be transferred to the appropriate data mart(s) via the grid at very high speed. Shared (conformed) dimension tables are also maintained on the hub and are easily and quickly pushed out to the spokes as required.
Users connect to the independent DM appliances as usual for running queries. This allows each DM to be tuned for the needs of a particular set of users and sized to handle the required level of performance and concurrency.
Multi-Temperature
The DATAllegro grid provides a multi-temperature system that is easy to manage while balancing performance and cost across the various periods for which data must be stored. For example, assume that the data warehouse as a whole must store seven years of historical data for compliance purposes. However, the most recent quarter and year are accessed far more frequently than older data. The most recent quarter can be placed on a very high-performance appliance, perhaps with enough RAM to cache the entire date range. Data from three to 12 months can be stored on a standard DATAllegro appliance with very good performance; and data older than one year can be stored on one of DATAllegro’s online archive appliances that offer up to 200TB of user data storage per rack at less than $8k per terabyte.
Each of the three separate appliances can be sized according to the performance and amount of storage required. As fresh data is loaded, the data that needs to be moved between the appliances is automatically moved across the grid. Incoming queries are automatically broken down into the relevant date ranges and the responses from the various queries collated into a single result set before being sent back to the user.
Disaster Recovery
The power of the DATAllegro grid concept extends across multiple data centers to provide a highly effective disaster recovery (DR) capability. Individual appliances can be replicated on a second site and automatically kept up-to-date by node-to-node replication. Data compression is used to reduce the required bandwidth between the two sites. Note that not all of the appliances on a grid need to be replicated. Hence, each business unit can decide whether to provide a DR capability, based on their own service-level agreements (SLAs).
The grid capitalizes on DATAllegro’s existing open-source, industry-standard architecture. Under the grid architecture, instead of moving or loading data within a single appliance, it occurs directly between nodes in different appliances to maximize parallelism and overall transfer speeds. Such a grid can move data between appliances at over a terabyte per minute, depending on the number of nodes in each appliance.
|