BAKK.COM



Cost effective data warehouse

The need for implementation of data warehouse is common – it concerns large corporations as well as small and medium enterprises. It stems from the fact that data warehouse constitutes a strategy of collecting and analyzing cross-sectional data pertaining to numerous aspects of organization activity linked together.

BAKK presents the manner of implementing a data warehouse with low costs and the utmost focus on return on investment period.

It would be difficult to describe the benefits of data warehouse without referring to a broader business context and solutions available on the market – therefore, we present a business justification for data warehouse profitability as well as alternative architectures to be applied.


Implementation profitability

On the point of making a decision pertaining to whatever investment – also about data warehouse implementation – every decision-maker considers the added value, the profit that the company will gain owing to the investment. The IT system value is a balance of all costs and benefits from an undertaking assuming key realizations of its functionality.
Specialists in data warehouse implementation choose analytical areas to be carried out in such a way so as to make the return on investment period as short as possible. Data warehouse is no longer a source of knowledge for strategic decision making with a two or three-year perspective, sponsors want to see the results straightaway,
New data warehouse implementations are focused on creating the analyses which support the operational activity and bring direct income. The most popular among them are: clients’ profitability analysis, clients’ purchase analysis, supply and storehouse stock list analysis or clearing analysis.
A skillful tradesman backed by a report on purchase dynamics within product groups will distinguish among company’s clients those who should be served better in order to increase turnover and thus his own commission. An area sales manager will take interest in the profitability of presently attended clients. Logistics and financial departments will aim at minimizing the amounts put into sales or production processes, finally a vindication department will be keen on choosing potential targets.
Another thematic field which companies willingly implement is the analysis aimed at giving financial and organizational discipline a boost. Controlling and accounting departments are in numerous analytical needs, which data warehouses are able to satisfy. A vast array of reports in the field of administrative accounting: indicatory analysis or reporting for superior units generated on the basis of a common source are of a much better quality and save employees’ time.
Analyses supporting marketing and human resources departments, which bring benefits in a longer course of time are slightly out of favor. Once popular association analyses of purchase, client segmentation and human resources analyses are usually postponed until “later data warehouse development”.

Factors which influence data warehouse price.

As in case of every IT project, there are three factors which have an influence on the price: personnel costs as well as equipment and software costs. Equipment and implementation costs optimization is a matter for a separate article, here I would like to concentrate on the aspects of minimizing software expenditure.
A minimal set of software necessary for data warehouse construction consists of:
  • Extract, transform, load (ETL) device,
  • Database,
  • Device for presenting and analyzing data.

License cost of an ETL device depends on a number of factors: producer’s license policy, amount and types of data sources, number of developer positions; therefore, it’s difficult to indicate any prices. One may, however, assume, with a certain safety margin, that the price range of ETL package licenses alone varies from $3,000 to $120,000, when delivered by leading suppliers.
The situation is similar as regards reporting devices, in which case the most common price-influencing factors are the number of seat licenses or for named users as well as the functional range of a device. The price range of licenses offered by leading suppliers begins with about $3,000, however, very often the price may reach as much as hundreds of thousands.
This is the reason for growing demand for the Microsoft solution, whereby, once one buys a database, ETL and reporting components are free.
Microsoft equipped the database Microsoft SQL Server 2000 with two major components which allow for a data warehouse construction without purchase of any further licenses. They are the ETL devices: DTS (Data Transformation Services) – now called Integration Services – and Analysis Services, which support the creation of multidimensional analyses on the server’s and client’s side.
A considerably long presence on the market doesn’t correlate with the number of implementations. Presently, owing to an important price pressure, the triad DTS – Microsoft SQL Server – Analysis Services is discovered by an increasing number of clients. The following parts of the article will be aimed at proving the high value and the abundance of functions of Microsoft’s warehouse devices. Their delivery, so to speak, on the occasion of data warehouse purchase lowers the price barrier constituted by license costs for ETL and a reporting device, which has discouraged many companies from implementing a data warehouse.

A simple data warehouse for “a thrifty user”.

I suggest considering the implementation of a data warehouse on the basis of DTS as a data extracting device, Microsoft SQL Server as a base for data storage, Analysis Services as a device for creating multidimensional analyses as well as the common and well-known Microsoft Excel or newer – Reporting Services – used as a reporting device.

Data Transformation Services

Tasks for DTS are defined in packages. Within a package there are tasks for defining transformations which data is to be subject to, target objects as well as indications pertaining to the order in which particular tasks are to be performed. The device enables establishing a link in order to download or upload data with numerous types of sources.
There is a possibility of cooperating not only with Microsoft SQL Sever database but also with Microsoft Access, dBase, Paradox, and through ODBC with every database to which those drivers have been created. Moreover, support is available for communication with Microsoft Excel and text files in both directions. It proves very useful when a warehouse is to communicate with popular financial and accounting software delivered by domestic producers, which doesn’t always work on the basis of a database that ODBC has been created for.
“Task” is a basic work unit performed within DTS package. Tasks available in DTS may be divided into five groups:
  • Designed for transport and transformation of data e.g.: Transform Data Task, Data Driven Query Task and Bulk Insert Task.
  • Programming based, enabling performance of additional operations on data, starting processes within a system or other DTS packages. Among these tasks there are: execute SQL Task, Execute Process Task, Execute Package Task.
  • Designed for transport of files, sending e-mails or handling MSMQ queues: File Transfer Protocol Task, Send Mail Task and Message Queue Task.
  • Tasks which support transferring objects between bases: Microsoft SQL Server e.g. Copy SQL Server Object Task, Transfer Database Task, Transfer Login Task etc.
  • Support for Analysis Services: Analysis Services Processing Task and Data Mining Task.
The last group will be described more extensively in the following part of the article. The order of task performance may be organized by means of workflow. Depending on what result the execution of a previous task brings (success, error or uninfluenced by result) steering will be transferred to an indicated following task.

Microsoft SQL Server database

Relational database stores data mined by means of the DTS device. Typical warehouse architecture involves dividing objects into stagging area objects, detailed data layer objects and aggregated data layer objects.
Stageing area is used for preparing data for transfer into detailed layer. Stagging area objects are used by loading processes with short-term goals e.g. for data clearing. Chart records within this layer may be subject to modification.
Retail data layer consists of structures which preserve data for the needs of data warehouse and constitutes its running basis. Information gathered within reflect the state of data of source system in the course of time. Charts within this layer are versionized by time, which enables presentation of retail data for a particular day. Chart records within this layer aren’t subject to modification.
Aggregated data layer consists of denormalized and/or aggregated data structures with the aim of improving the effectiveness of reporting and analyses. Charts within this layer are versionized by time, which enables reconstruction of the state of current knowledge for a particular day of performed analysis. Depending on functionality and effectiveness needs data within this layer may be subject to modification.

Analysis Services

Analysis Services is a set of components which enable reporting in multidimensional technology – OLAP. The product functions on its own and is not dependent on relational database Microsoft SQL Server, despite common distribution.
Data is extracted from relational structures, modified into a multidimensional model and, aggregated, it is sent to multidimensional cubes. They are extracted, on demand, from Pivot Table Service installed on client’s station and transferred to reporting devices e.g. Cube Browser or Microsoft Excel.
The basis for multidimensional reporting is constituted by a multidimensional cube. Cubes are created in the course of structure construction. In the data warehouse life cycle, once every defined period, data is added on the basic level. Then, the whole cube is modified in order to obtain refreshed aggregated data. It is particularly this purpose that one of the two above mentioned DTS tasks serves: Analysis Services Processing Task.
Analysis Manager is a device which enables editing cubes. It renders possible establishing a link with data sources containing values as well as hierarchies of measures and dimensions. Analysis Services enables connecting with every data source to which there are ODBC drivers available and creating cubes on the basis of data extracted from thereby. They may be, in particular, data sources on platforms other than Windows.
Within cubes there is a possibility of creating calculated measures and virtual cubes as well as restricting users’ access to data. Security mechanisms are integrated with those on Windows domain. By means of Analysis Manager one can create data mining models both on the basis of relational and multidimensional data. There are two algorithms available: clustering and decision trees.
It may happen in almost every company that after the humdrum and stressful process of creating an application, the time comes for its implementation which causes fear on the part of users who have to learn “another software”. By means of Analysis Services it is possible to make the atmosphere more relaxed because users who have less time for learning new gadgets may be handed in reports in the form of Microsoft Excel pivot tables linked with a cube. Thus, users obtain new data within an already familiar environment, which facilitates its use. The author has observed that neglecting user’s preferences pertaining to access to data may easily discourage any report addressees from using them. And what does a company need even the best data warehouse for, if it’s not in use?

Conclusion

The suggested solution covers a part of functionality which a modern Business Intelligence system should ensure. Its implementation is inexpensive and technologically simple. It doesn’t mean, however, that I consider construction of a profit-generating data warehouse to be trivial. By means of the described architecture one avoids numerous difficulties in terms of incompatibility and is enabled to concentrate on the main goal, getting cash out of data.