The ROUSSEEUW datasets

This directory contains the datasets taken from

Robust Regression and outlier Detection

by Peter J. Rousseeuw and Annick M. Leroy

About the datasets

The datasets are simple unformatted files. Their names have the extension .dat. If you want to have all datasets you should copy the file all.dat.

The data

.page 22, table 1

Pilot-Plant Data Set from Daniel and Wood (1971),
The response variable corresponds to the acid content determined by titration, and the explanatory variable is the organic acid content determined by extraction and weighing.
20 subjects, 3 variables:
  1. Observation (i)
  2. Extraction (x[i])
  3. Titration (y[i])

.page 26, table 2

Number of International Calls from Belgium, taken from the Belgian Statistical Survey, published by the Ministry of Economy,
73 subjects, 2 variables:
  1. Year(x[i])
  2. Number of Calls (y[i], in tens of millions)

.page 27, table 3

Data for the Hertzsprung-Russell Diagram of the Star Cluster CYG OB1, from C.Doom
47 subjects, 3 variables:
  1. Index of Star (i)
  2. logarithm of the effective temperature at the surface of the star (x[i])
  3. logarithm of the light intensity of the star (y[i])

.page 47, table 4

First Word - Gesell Adaptive Score Data (from Mickey et al.,1967),
21 subjects, 3 variables:
  1. Child (i)
  2. Age in Months (x[i])
  3. Gesell Score (y[i])

.page 57, table 7

Body and Brain Weight for 28 Animals, from Weisberg (1980) and Jerison (1973),
28 subjects, 4 variables:
  1. Index (i)
  2. Species
  3. Body Weight (x[i], in kilograms)
  4. Brain Weight (y[i], in grams)

.page 62, table 10

Data on the Calibration of an Instrument that Measures Lactic Acid Concentration in Blood, from Afifi and Azen (1979),
20 subjects, 3 variables:
  1. index (i)
  2. True Concentration (x[i])
  3. Instrument (y[i])

.page 73, table 13

Pension Funds for 10 Professional Branches, from de Wit (1982)
The table lists the total 1981 premium income of pension founds of dutch firms, for 18 professional branches. In the other column the respective premium reserves are given.
18 subjects, 3 variables:
  1. Index
  2. Premium Income (in millions of guilders)
  3. Premium Reserves (in millions of guilders)

.page 76, table 1

Stackloss data, from Brownlee (1965)
The data describe the operation of a plant for the oxidation of ammonia to nitric acid.
21 subjects, 5 variables:
  1. Index (i)
  2. Rate (x[1])
  3. Temperature (x[2])
  4. Acid Concentration (x[3])
  5. Stackloss (y)

.page 79, table 2

Coleman Data Set, Containing Information on 20 Schools from the Mid-Atlantic and New England States, from Mosteller and Tukey (1977)
20 subjects, 7 variables:
  1. Index
  2. staff salaries per pupil (x[1])
  3. percent of white-collar fathers (x[2])
  4. socioeconomic status composite deviation: means for family size, family intactness, father's education, mother's education, and home items (x[3])
  5. mean teacher's verbal test score (x[4])
  6. mean mother's educational level (x[5]), one unit is equal to two school years
  7. verbal mean test score (y, all sixth graders)

.page 82, table 5

Salinity Data, from Ruppert and Carroll (1980)
That is a set of measurements of water salinity (i.e., its salt concentration) and river discharge in taken in North Carolina's Pamlico Sound.
28 subjects, 5 variables:
  1. Index (i)
  2. Lagged Salinity (x[1])
  3. Trend (x[2])
  4. Discharge (x[3])
  5. Salinity (y)

.page 86, table 6

Air Quality Data Set for May 1973, from Chambers et al. (1983)
31 subjects, 5 variables:
  1. Index (i)
  2. Solar Radi (x[1])
  3. Windspeed (x[2], in miles per hour)
  4. Temperature (x[3], in degrees Fahrenheit)
  5. Ozone (in parts per billlion) (y)

.page 94, table 9

Artifical Data Set generated by Hawkins, Bradu, and Kass (1984)
75 subjects, 5 variables:
  1. Index
  2. x[1]
  3. x[2]
  4. x[3]
  5. y

.page 96, table 10

Cloud point of a Liquid, from Draper and Smith (1969)
The cloud point is a measure of the degree of crystallization in a stock.
19 subjects, 3 variables:
  1. Index (i)
  2. Percentage of I-8 (x)
  3. Cloud point (y)

.page 103, table 13

Heart Catherization Data, from Weisberg (1980)
A catheter is passed into a major vein or artery at the femoral region and moved into the heart. The proper length of the introduced catheter has to be guessed by the physician. The aim of the Data is to describe the relation between the catheter length and the patient's height.
12 subjects, 4 variables:
  1. Index (i)
  2. Height (x[1], in inches)
  3. Weight (x[2], in pound)
  4. Catheter Length (y, in centimeters)

.page 110, table 16

Education Expenditure Data, from Chatterjee and Price (1977)
50 subjects, 7 variables:
  1. Index
  2. State
  3. Region (1=Northeastern, 2=North central, 3=Southern, 4=Western)
  4. Number of residents per thousand residing in urban areas in 1970 (x[1])
  5. Per capita personal income in 1973 (x[2])
  6. Number of residents per thousand under 18 years of age in 1974(x[3])
  7. Per capita expenditure on public education in a state, projected for 1975 (y)

.page 154, table 22

Aircraft Data, deals with 23 single-engine aircraft built over the years 1947-1979, from Office of Naval Research
23 subjects, 6 variables:
  1. Index
  2. Aspect Ratio
  3. Lift-to-Drag Ratio
  4. Weight
  5. Thrust
  6. Cost

.page 155, table 23

Delivery Time Data, from Montgomery and Peck (1982)
The aim is to explain the time required to service a vending machine by means of the number of products stocked and the distance walked by the route driver.
25 subjects, 4 variables:
  1. Index (i)
  2. Number of Products (x[1])
  3. Distance (x[2])
  4. Delivery time (y)

.page 156, table 24

Phosphorus Content Data, investigates the effect from inorganic and organic Phosphorus in the soil upon the phosphorus content of the corn grown in this soil, from Prescott (1975)
18 subjects, 4 variables:
  1. Index (i)
  2. Inorganic Phosphorus (x[1])
  3. Organic Phosphorus (x[2])
  4. Plant Phosphorus (y)

