Wednesday, April 22, 2009

Optimizing Your SAS Performance

Google has become very successful by developing an efficient search engine running on commodity hardware. It no longer uses the old model of putting all its resource onto one super computer, but rather it spreads that processing onto a cluster of smaller machines running in parallel to form a grid. Gordon Moore made an observation in 1965 predicting that the number
of transistors per square inch used on computers would double every year. This trend has become law and continues to elevate the ubiquitous and relatively inexpensive desktop and laptop computers. This paper will discuss how you can cluster computers in a grid to optimize the execution of SAS programs. Some of the techniques discussed include:

  • Implementing supercomputer power with commodity hardware

  • Submitting SAS programs sequentially while maintaining inter program dependencies

  • Threading multiple groups of programs for optimal performance

  • Measuring SAS performance with Statmark, a standard metric for a cross platform benchmarking for SAS processing

  • Scheduling the execution of programs in a grid environment

In the world of Moore╩╣s law, it makes less sense to lay out large capital investment for a server. Clustering inexpensive smaller machines and dynamically adding new computers to this architecture within a grid can scale your SAS computing resources to become the Google of search engines.

Introduction
In the space of analytics as statistical models get more sophisticated and the datasets gets larger, computing resources is much needed engine that delivers results. SAS has evolved along with hardware systems to utilize the horse power needed to crunch the statistical models and data manipulations. When I first started working with SAS, it was on a main frame computer system running TSO. This was centrally controlled with very limited user customization from a dumb terminal. As computing chips got smaller, the processing of SAS started to move toward smaller UNIX servers. Then the introduction of SAS on personal computers dramatically changed how most users performed their data exploration. Users were testing out their data models and reports on their PCs, although they still executed things on a networked server for production jobs. This evolution is continuing as the desktop is becoming more powerful. With maturing technologies used to connect these desktop computers, PC desktops are beginning to form computing grids that can outperform the traditional servers. The forces that drive this include the shrinking size and cost of computer chips while performance is increasing. This is coupled
with the lowering cost of memory and storage. These combined elements supply analytical tools such as SAS with greater abundance of computing resources. We are at a juncture in this evolutionary stage where the ways the computing resources are utilized can be more important than just obtaining the resources.

IT managers need to evaluate the cost of the lifetime of a server since the price to performance ratio of the computing resources would diminish over time. It is similar to purchasing a car in that the performance of the car does not go any faster but the value of the car is constantly going down. Computing resources have an even lower return on investment in that they become obsolete very quickly as the next model is usually cheaper, yet outperforms the current server model. It is therefore not always prudent to put out large capital expenditures on a piece of hardware when its performance to price ratio will diminish in such short spans of time. Grid computing offers a different model in that commodity hardware can be expense with less cost. There is greater flexibility in that the grid can scale to match the performance of a growing group without necessarily throwing out the old server for replacement of the new. Nodes can be added and older nodes can be taken off like a living organism shedding dead skin. In the Grid, the newer nodes have the advantage of obtaining the fastest computing power for the cost at that time. This spreading out of the capital expenses on computing resources is analogous to the time valued benefits of spreading out your investments and investing small amounts over your lifetime to form a balanced portfolio instead of putting one big sum investment into a single stock. It acts as a buffer towards the ups and downs of the markets. In this case, it is not the financial market but rather the market of computing hardware cost. As hardware costs continue to get cheaper per price performance, the cost of software seems to get more expensive as the complexity of the software increases. Licensing SAS is not cheap so it is wise to optimize the hardware which SAS runs on since over time, the hardware cost will be a fraction compared to the software cost.

One of the key components in the optimization of computing cost is the ability to measure with precision the performance of your system. This metric can help you evaluate the return on your investment. Without any form of measurement, it is like shopping for a credit card without having the ability to know what the APR or interest rates are. This paper will premiere a free utility called Statmark by MXI that will allow you the tools to make the right decision in hardware implementations.

SAS Institute has had the technology to run its jobs on remote machines for many years with SAS/Connect. It utilizes protocols such as TCP/IP to connect to a remote machine and have your program run remotely. SAS Grid computing leverages this along with other software such as the Grid Manager to optimize the performance of SAS on multiple nodes to optimize the computing resources within a grid.

more found at paper at Sy Truong papers and SAS Performance Tuning and SAS Grid Computing.
Bookmark and Share

No comments:

Post a Comment