STATISTICAL PROGRAM PACKAGES AND SOFTWARE +----------------+
| C.C. Note # 23 |
+----------------+
| gen : stat |
+----------------+
07-Aug-1992
An introduction to computing in statistics.
Emphasis is given to the use of statistical
program packages, in particular those which
are available on the Silicon Graphics UNIX
system at Auckland University.
Keywords: statistics, packages, sas, spss, genstat, minitab
dataset
WHAT IS A STATISTICAL PACKAGE?
A statistical package is computer program which reads in
information, performs mathematical calculations associated with
standard statistical procedures, and prints the results, organized
and formatted. Many such packages have been produced, but this
Note dwells only on those available on the Auckland University
Silicon Graphics system.
REQUIREMENTS FOR USING A STATISTICAL PACKAGE
Certain things are ncessary before you can make effective use of
computer software for statistical calculations. These include
prerequisites for any work in statistics, with or without a
computer:
- You should have a specific question to answer (more precisely,
an hypothesis to test), and data from experiments or surveys which
can be analysed to give the answer.
- You should know what statistical techniques are needed to get
the answers from the data.
- You should understand the statistical method or analysis you
wish to perform.
The Computer Centre cannot offer much help on these topics.
Other things are peculiar to work with computers:
- You must have a program to perform the necessary calculations
which will run on an available computer.
Page 2
- The information must be in machine-readable form, normally as a
file on disk. It should be organized in a systematic way.
We are more likely to be able to help you here. This Note
introduces you to programs available on our Silicon Graphics for
anyone to use. Since you can use these, you don't have to spend
time writing your own programs. We may also be able to offer help
with your data: if it is at present written on paper, you may
find the Computer Centre's Data Capture service useful to get it
into the computer (see CC Note # 119). If you are being really
clever, and are reading this before you collect the data, you
might also be interested in CC Note # 118, which offers hints on
designing forms for data collection.
GENERAL NOTES ON STATISTICAL PACKAGES
It is useful to introduce some terms which commonly occur in the
context of statistical processing using a computer:
VARIABLE - a set of values of a particular attribute of the object
under scrutiny. Sex, age and weight might be different variables
in some dataset.
OBSERVATION (or CASE) - the set of data values that describe one
individual. The surname, sex, age and weight of one experimental
subject might comprise an observation.
DATASET - the whole collection of data to be analysed.
RECORD - a line entered at a terminal, or a "line" (logical record
on disk or tape. (An observation may possibly occupy more than
one record).
The first three of these terms describe the data itself, while he
last is concerned with the physical representation of the data.
All of the vriables for an observation are contained in one or
more records of some dataset. Each piece of information (the
value of each variable) must be given in an appropriate form, or
marked as "missing data" if the package so permits. It may be
useful to encode the data into some convenient form: if you wish
to analyse a questionnaire, for example, answers could be given
alphabetic codes such as Y=yes, N=no, or numeric codes, such as
1=agree, 2=disagree. If the information is to be used in a
mathematical formula (such as in calculating a regression
coefficient), the information must be coded numerically. Several
packages, including SAS, can handle alphabetic information for
non-mathematical uses, but some will not accept it at all.
Before the package can do anything for you, you have to give it
instructions. Using the language of the package, therefore, you
provide the following information:
Page 3
- Where to find the data : "at the end of the program", or "in a
disk file called ....".
- The structure of the observations in the data: what variables
are present, and how they are arranged in the records.
- Which statistical procedure(s) are to be performed and which
variables are to be used with each.
Most statistical packages will let you do the following if you
wish:
- Specify a label for each of the variables to be used on the
printout.
- Specify a label for each of the values of a variable which will
be displayed on the printout.
- Specify values which are to be treated as "missing data" and not
included in the statistical computations.
- Transform the values of input variables.
- Create new variables as functions of variables read in.
- Save all of the above information about variables and data in a
special file on disk. Notice that this is not the same as the
dataset itself.
This is information about the dataset, so it could be - and
frequently is - used with several datasets in turn. A saved file
saves time and money, since all of the information needed is
already present and you need only specify the name of the saved
file to be read, the statistical procedure(s) to be performed, and
which saved variables are involved.
The instructions must be prepared in the form peculiar to the
particular package. In order to use a package, you have to learn
its language. One major difference between packages lies in their
syntax - the spelling of keywords, the punctuation, and the order
of the instructions. The most frequent errors in using a package
are syntactic. Careful spelling and punctuation are required:
close study of the appropriate manual will be amply rewarded.
CHOOSING A STATISTICAL PACKAGE
All statistical packages are similar in that they require a basic
familiarity with statistics and relatively little knowledge of
conventional programming languages. The user enters "control" or
"parameter" statements in a relatively "natural" or
"English-like"language.
Page 4
In selecting a statistical package, you should consider two
things:
Most important - does the package contain the appropriate
statistical procedures, and will it compute and report all of the
statistics required? All the large packages will do descriptive
statistics, such as means, standard deviations, frequency
distributions, cross tabulation, chi-squares, etc. For more
advanced and/or specialized statistical procedures, check the
manuals of appropriate packages to make sure that the analyses,
options and results required are all available. This time will be
well spent and save untold frustration and delay later.
Next - how familiar are you with computing? If you are a newcomer
to statistical packages, look for a manual which is easy to
understand and tutorial in character. (You may not find one, but
th SAS and SPSS texts aren't too bad). Statistical package
manuals are not a good source of statistics training, however, so
a statistics textbook may be a necessity.
Owing to limited staff resources, the Computer Centre
can only support one statistical package, and that is
SAS. The other packages mentioned below are available
to those who wish to use them, but little help can be
provided by the Computer Centre.
SAS (Silicon Graphics; also available in MS-DOS version)
SAS (which stands for Statistical Analysis System) is an
integrated package notable for its wide range of applications,
from the simplest descriptive statistics to very complicated
general linear modelling.
SAS is fairly easy for the beginner to learn. The Introductory
manual is tutorial in character, leading the new user from the
initial organization of data through descriptive statistics,
analysis of variance, regression, and report writing.
Some notable features include an adequate plotting program, a
procedure to produce bar charts, pie charts, etc., schematic
plots, and formatted, titled reports. SAS also contains a
relatively extensive programming language that allows for
do-loops, arrays, macros, subroutines, selectable input formats,
and arbitrary reports. It is the only package that can interpret
data recorded in non-standard formats such as dollars and cents
formats, date-time formats, and (USA) social security numbers as
well as packed and zoned decimal, integer and floating point
binary. Input/output is efficient enough to handle very large
datasets (300,000+ observations). (Finding enough disk space to
put them on is another matter entirely!).
Page 5
SAS can be used interactively, in which case things happen as you
enter instructions at the terminal and you control the sequence o
operations step by step. Alternatively, you may first write a
complete program in the SAS language and then leave SAS to perform
the whole lot. The language is essentially the same either way.
SAS also has a full-screen editor procedure for changing SAS
datasets. It is the most powerful of the packages for merging and
amending datasets. Labelling code values and recoding variables
can, however, be more awkward in SAS than in some other packages.
The SAS User's Guide does not attempt to teach you statistics as
the SPSS manual does. It assumes that you already have enough
knowledge to choose your method of analysis.
SPSS (Silicon Graphics only);
SPSS (which stands for Statistical Package for the Social
Sciences) is an integrated package available on a variety of
machines. It is designed for the beginner and has a large,
comprehensive manual containing explanations of the various
statistical procedures as well as many examples of the control
statements and procedures. The package is fairly efficient and
can handle large amounts of data. SPSS can read data in
non-standard formats such as binary, packed or zoned decimal, and
floating point.
SPSS-X has a wide range of statistical procedures and a large
number of non-parametric tests not generally found in other
packages. Multiple response type variables, often found on
questionnaires, are also handled by SPSS-X. Nevertheless, other
things being equal, SPSS-X is not recommended above SAS.
GENSTAT (Silicon Graphics only);
GENSTAT is a very powerful and flexible statistical package. It
covers the fields of analysis of designed experiments and
multivariate analysis. In some senses it is a language for
describing how the analysis is to be performed and what tests are
to be performed. It was designed for use by statisticians, and
one needs a detailed knowledge of statistics to make use of the
more sophisticated features. Unfortunately, GENSTAT is not easy
to use for even quite simple analyses. It is not recommended to
novices. However, a macro facility enables an experienced
statistician to write a macro for an analysis by a less
experienced user.
MINITAB
This is a simple interactive package mainly used by undergraduate
students.
Page 6
See also: CC Notes # 78, 92, 93 (SAS) and 74 (SPSS-X).
Acknowledgment: Some of this material originated at Yale
University.
Responsible: Russell Fulton