Descriptive Statistics

Sheldon Yard. Ross , in Introduction to Probability and Statistics for Engineers and Scientists (Fifth Edition), 2014

Definition

The sample 25 percentile is called the first quartile, the sample 50 percentile is chosen the sample median or the second quartile, the sample 75 percentile is called the tertiary quartile.

The quartiles intermission up a data gear up into iv parts, with roughly 25 percent of the data being less than the first quartile, 25 percent beingness betwixt the first and second quartile, 25 percentage being between the second and third quartile, and 25 percent being greater than the third quartile.

Instance 2.3i

Dissonance is measured in decibels, denoted every bit dB. One decibel is virtually the level of the weakest sound that can be heard in a quiet surrounding by someone with good hearing; a whisper measures about 30 dB; a human vox in normal conversation is nigh seventy dB; a loud radio is about 100 dB. Ear discomfort usually occurs at a racket level of about 120 dB.

The following data requite dissonance levels measured at 36 different times straight outside of Grand Central Station in Manhattan.

82, 89, 94, 110, 74, 122, 112, 95, 100, 78, 65, 60, 90, 83, 87, 75, 114, 85 69, 94, 124, 115, 107, 88, 97, 74, 72, 68, 83, 91, xc, 102, 77, 125, 108, 65

Decide the quartiles.
Solution

A stem and leafage plot of the data is as follows:

half dozen 0,5,5,8,ix
seven 2,4,4,5,7,8
8 2,three,3,v,7,eight,9
9 0,0,1,four,4,5,7
ten 0,2,7,viii
eleven 0,ii,4,5
12 2,4,5
Because 36/4   =   9, the first quartile is 74.5, the average of the 9th and 10th smallest data values; the second quartile is 89.5, the average of the 18th and 19th smallest values; the tertiary quartile is 104.five, the average of the 27th and 28th smallest values. ■

A box plot is often used to plot some of the summarizing statistics of a information gear up. A directly line segment stretching from the smallest to the largest data value is fatigued on a horizontal axis; imposed on the line is a "box," which starts at the kickoff and continues to the third quartile, with the value of the second quartile indicated by a vertical line. For instance, the 42 data values presented in Tabular array two.1 go from a low value of 57 to a high value of seventy. The value of the offset quartile (equal to the value of the 11th smallest on the listing) is 60; the value of the second quartile (equal to the average of the 21st and 22nd smallest values) is 61.five; and the value of the third quartile (equal to the value of the 32nd smallest on the list) is 64. The box plot for this data set is shown in Figure 2.7.

Figure 2.vii. A box plot.

The length of the line segment on the box plot, equal to the largest minus the smallest information value, is called the range of the data. Besides, the length of the box itself, equal to the third quartile minus the get-go quartile, is called the interquartile range.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780123948113500022

Descriptive Methods

Ronald N. Forthofer , ... Mike Hernandez , in Biostatistics (2nd Edition), 2007

3.4.1 Range and Percentiles

The range is divers as the maximum value minus the minimum value. It is simple to calculate, and it provides some idea of the spread of the information. For the patients under threescore years of age in Tabular array 3.10, the range is the departure between 152 and 100, which is 52. In the second data prepare pertaining to patients 60 and over, the range is the divergence between 170 and 100, which is lxx.

Table 3.10. Systolic claret pressure (mmHg) of patients under 60 years and sixty years and over, DIG40.

Under 60 Years 60 Years and Over
100 115 116 120 120 100 104 105 110 116
124 130 130 130 130 116 120 122 128 128
134 140 140 140 140 130 130 138 140 140
150 152 140 144 144 150 150
150 170 170

This divergence in the two ranges points to a contrast betwixt the two data sets. Although the range tin can be informative, the range has two major deficiencies: (1) Information technology ignores about of the data, since merely ii observations are used in its definition, and (2) its value depends indirectly on sample size. The range will either remain the same or increase as more observations are added to a data set up; it cannot decrease. A better measure of variability would use more of the information in the information past using more of the information points in its definition and would non be then dependent on sample size.

Percentiles, deciles, and quartiles are locations of an ordered data gear up that divide the data into parts. Quartiles dissever the data into four equal parts. The get-go quartile (q1), or 25th percentile, is located such that 25 percent of the information lie below q1 and 75 pct of the data lie in a higher place q1. The 2nd quartile (q2), or 50th percentile or median, is located such that half (fifty percent) of the data prevarication beneath q2 and the other one-half (fifty per centum) of the data prevarication above q2. The third quartile (q3), or 75th percentile, is located such that 75 percent of the data prevarication below q3 and 25 pct of the data prevarication above q3. The interquartile range, the difference of the 75th and 25th percentiles (the third and first quartiles), uses more information from the data than does the range. In addition, the interquartile (or semiquartile) range tin can either increase or decrease every bit the sample size increases. The interquartile range is a mensurate of the spread of the middle 50 pct of the values. To find the value of the interquartile range requires that the first and third quartiles be specified, and in that location are several reasonable ways of calculating them. We shall employ the following process to calculate the 25th percentile for a sample of size due north:

one.

If (northward + 1)/4 is an integer, then the 25th percentile is the value of the [(north + 1)/4]th smallest observation.

2.

If (due north + ane)/iv is not an integer, then the 25th percentile is a value between two observations. For instance, if n is 22, and so (n + 1)/4 is (22 + 1)/4 = 5.75. The 25th percentile so is a value three-fourths of the way between the fifth and 6th smallest observations. To find information technology, nosotros sum the 5th smallest observation and 0.75 of the difference between the 6th and 5th smallest observations.

The sample size is xl for the systolic blood pressure information in Table three.xi. According to our procedure, we brainstorm by sorting the information in ascending social club. Adjacent, we calculate (40 + 1)/4, which is 10.25. Hence the 25th percentile is a value 1-fourth of the mode between the tenth and 11th smallest observations. Since the tenth and 11th smallest observations have the same value of 120, the 25th percentile of the starting time quartile is 120 mmHg. The 75th percentile is found in the aforementioned way except that nosotros use 3(n + 1)/4 in place of (n + i)/four. Since three(40 + 1)/4 yields 30.75, the 75th percentile is a value three-fourths of the way between the 30th and 31st observations. Since the 30th and 31st observations accept the same value of 140, the 75th percentile, or the third quartile, is 140 mmHg. Hence, the interquartile range is 140 − 120 = 20. Calculating the interquartile range for systolic claret pressure readings of patients under 60 years of age and lx years and over gives the values 20 and 28, respectively. The larger interquartile range for the 60 and over age group suggests that there is more variability in the data compared to the systolic blood pressure readings for the younger historic period group.

Table 3.11. Systolic blood pressure level of patients who have had a previous myocardial infarction stratified by the dose level of Digoxon treatment assigned, DIG200.

Low Dose Digoxon Treatment (0.125 mg/dL) High Dose Digoxon Treatment (0.375 mg/dL)
140 102 85 160 150 96 118 120 124 140
144 130 130 110 110 120 122 130 140 150

The values of five selected percentiles — the 10th, 25th, 50th, 75th, and 90th — when considered together provide good descriptions of the central tendency and the spread of the information. Still, when the sample size is very pocket-size, the calculation of the extreme percentiles is problematic. For example, when n is 5, it is difficult to make up one's mind how the 10th percentile should exist calculated. Because of this difficulty, and too because of the instability of the farthermost percentiles for modest samples, nosotros shall calculate them just when the sample size is reasonably large — say, larger than 30. The adjacent measure of variability to be discussed is the variance, only, before considering it, we hash out the box plot because of its relation to the five percentiles.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012369492850008X

Descriptive Statistics I: Univariate Statistics

Andrew P. King , Robert J. Eckersley , in Statistics for Biomedical Engineers and Scientists, 2019

ane.8 Using MATLAB for Univariate Descriptive Statistics

Almost of the functions required to perform univariate descriptive statistical analysis are core functions within MATLAB. You tin utilize the MATLAB doc or help commands to get more than information on all of these functions. Where there is no built-in MATLAB function, we provide implementations of our own in the sections below.

Many of the lawmaking examples included beneath brand utilize of the student height data introduced in this chapter. These data are available through the book's spider web site as "fortyStudentHeightData.mat" and "fortyStudentGenderData.mat". The data are stored as column arrays.

i.viii.ane Visualization of Univariate Data

Dotplot:

There is no born MATLAB command to produce a dotplot. The part that we provide below, which is bachelor to download every bit "dotplot.g" from the book'due south web site, was used to produce the examples in this affiliate.

Histogram:

Image 2

This produces a histogram from the data in the array 10. The parameter nbins can be used to specify the number of bins to use in the histogram. If nbins is not specified, then MATLAB will automatically decide the number of bins.

Bar Nautical chart:

Bar charts can be produced using the MATLAB bar command. The bar nautical chart of educatee height data shown in Fig. 1.5 was produced using the following code example, which starts from an assortment of categorical (M or F) values called gender:

Annotation how the MATLAB chiselled type is used to set the labels for the bars.

1.8.2 Calculating Measures of Central Tendency

Hateful, Median and Mode:

There are built-in commands for each measure of central trend in MATLAB, as the following code illustrates.

one.8.iii Calculating Measures of Variation

Standard Deviation and IQR:

There are built-in commands for each measure out of variation in MATLAB, equally illustrated in the code below:

Upper and Lower Quartiles :

Values for the upper and lower quartiles can exist calculated using the quantile role. Notation that the concluding parameter denotes which fraction of the data should be below the returned value.

Skew:

Using the same data as before, the skew, equally defined in Section ane.five.iii, tin can be calculated in MATLAB every bit follows:

1.eight.4 Visualizing Measures of Variation

Error Bars:

The MATLAB errorbar part tin can be used to produce plots with error bars. The post-obit MATLAB commands were used to produce the plot shown in Fig. i.eight:

Box Plot:

The MATLAB boxplot function produces box plot visualizations. Its outset argument is the continuous or discrete data array whose values are to be used for the y-centrality of the plot. The second argument is a "grouping" variable whose values will be used to split up the information of the first argument. The post-obit MATLAB lawmaking was used to produce the box plots shown in Fig. 1.9.

Note that the default behavior in MATLAB box plots is to consider a data value to be an outlier if it is larger than the upper quartile +i.5 × IQR or less than the lower quartile −1.5 × IQR. This is a reasonable way to highlight potential outliers, merely it is non a rigorous method to discover them.

Read total affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B9780081029398000104

Integrated Population Biology and Modeling, Part A

David A. Swanson , Lucky One thousand. Tedrow , in Handbook of Statistics, 2018

iii Results

3.one Changes in Life Expectancy and Differences Among the SES Quartiles

Table 1 provides the summary life expectancy results by SES quartile and yr. In this table we can see that the mean life expectancy at nascence (east 0) in 1970 was 50.52 for lowest SES (first) quartile, increased to 57.33 in the second quartile, increased again to 62.two in the third SES quartile, and reached most 69.51 in the highest SES (quaternary) quartile. Thus, there was a 19-year difference in e 0 between the lowest and highest SES quartile. The coefficient of variation (CV) was highest in the two lowest SES quartiles, 0.12 and 0.13, respectively, and lowest in the two highest SES quartiles, 0.10 and 0.07, respectively. These values are consistent with findings elsewhere (east.grand., Swanson and Tedrow, 2016) that variation in age at death declines equally due east0 increases.

Table 1. Mean Life Expectancy at Birth (due east 0) past PCDGP Quartile, 1970, 1990, and 2010

2010 Quartile 1 eastward 0 1990 Quartile ane e 0 1970 Quartile 1 e 0
MEAN 59.56 MEAN 55.37 MEAN 50.52
MEDIAN 59.24 MEDIAN 55.99 MEDIAN l.57
STDEV 7.18 STDEV 5.64 STDEV five.92
CV 0.12 CV 0.ten CV 0.12
2010 Quartile 2 e 0 1990 Quartile ii east 0 1970 Quartile 2 e 0
MEAN 68.63 MEAN 64.36 Hateful 57.33
MEDIAN seventy.33 MEDIAN 65.05 MEDIAN 57.xix
STDEV vi.31 STDEV six.85 STDEV 7.38
CV 0.09 CV 0.11 CV 0.13
2010 Quartile three e 0 1990 Quartile three eastward 0 1970 Quartile 3 e 0
Mean 73.36 Hateful 69.59 MEAN 62.20
MEDIAN 74.37 MEDIAN 70.76 MEDIAN 63.57
STDEV 5.12 STDEV 4.25 STDEV half dozen.19
CV 0.07 CV 0.06 CV 0.x
2010 Quartile 4 eastward 0 1990 Quartile four eastward 0 1970 Quartile 4 e 0
Hateful 78.97 Mean 75.22 MEAN 69.51
MEDIAN 80.05 MEDIAN 75.65 MEDIAN 70.87
STDEV 4.22 STDEV 2.83 STDEV 4.71
CV 0.05 CV 0.04 CV 0.07

Table 1 also shows that between 1970 and 1990 that mean eo increased by almost five years in the everyman SES quartile, from fifty.52 to 55.37 and information technology increased again between 1990 and 2010, from 55.37 to 59.56 years, an increase of just over 4 years. For the 2d quartile, mean eo increased over 7 years between 1970 and 1990, from 57.33 to 64.36, and increased merely over four years between 1990 and 2010, from 64.36 to 68.63. The third quartile shows an increase in hateful easto of more than 5 years between 1970 and 1990, going from 62.20 to 69.59 and an increase of but under 4 years between 1990 and 2010, from 69.59 to 73.36. In the highest quartile, mean due easto increased by virtually 6 years between 1970 and 1990, from 69.51 to 75.22, while between 1990 and 2010, it increased by simply under 4 years, from 75.22 to 78.97. As a summary, it is clear that increases in mean e 0 occurred for each SES quartile over the unabridged 40-year period, 1970–2010 and in each of the two 20-yr time periods, 1970–90 and 1990–2010.

Tables ii and 3 provide another perspective on changes in e 0 by SES over time in that they show the absolute and relative changes in eastward 0 between 1970 and 1990 and betwixt 1990 and 2010, respectively. In comparing these two tables, we encounter that both absolute and relative changes in hateful due east 0 found in the 1970–90 period declined for all four SES groups between 1990 and 2010. Thus, while hateful e 0 increased in all four SES quartiles between 1970 and 1990 too as between 1990 and 2010: (one) the level of increase found in the 1970–xc period declined for all 4 groups in the 1990–2010 period; (2) the differences between the lowest quartile and the college ones remained the same or increased slightly between 1970 and 2010; and (3) differences in hateful e 0 amongst the higher 3 quartiles remained the aforementioned or declined slightly over the same 40-yr menstruum.

Tabular array 2. Change in Mean east 0 by Quartile, 1970–1990

Change in Mean e 0 by Quartile, 1970–1990
Quartile Accented Percent
1 (lowest PCGDP) 4.85 9.sixty
2 7.03 12.26
3 7.39 11.88
4 (highest PCGDP) 5.72 eight.23
Northward  =   159

Table three. Change in Mean e 0 by Quartile, 1990–2010

Change in Mean e 0 past Quartile, 1990–2010
Quartile Accented Percent
i (lowest PCGDP) four.xix 7.57
ii 4.27 6.64
3 iii.77 5.41
4 (highest PCGDP) 3.75 4.98
Northward  =   159

3.2 Changes in SES

Table iv shows summary statistics for the changes in SES (PCGDP) by quartile and year, while Tables 5 and vi prove the changes that occurred between 1970 and 1990 and between 1990 and 2010. In Table 5, we come across that mean PCGDP betwixt 1970 and 1990 increased by 10.63% ($39.22 in constant US 2005 dollars) for the lowest quartile and by 54.32% ($ix,306 in constant The states 2005 dollars) for the highest quartile. Table 6 shows that between 1990 and 2010, the mean PCGDP in the lowest quartile increased by 17.66% ($72.07 in constant US 200 dollars) and by 23.85% ($6235.46 in constant US 2005 dollars) in the highest quartile. Comparing Tables 5 and 6 reveals that populations in the 2d and third quartiles experienced higher absolute and relative changes in hateful PCGDP between 1990 and 2010 than constitute between 1970 and 1990 ( Figs. 5–vii).

Table iv. Summary Measures of PCGDP past Quartile and Year

2010 Quartile i 1990 Quartile 1 1970 Quartile 1
Hateful $545.75 MEAN $402.45 MEAN $369.92
MEDIAN $480.20 MEDIAN $408.xiii MEDIAN $368.xc
STDEV $227.74 STDEV $156.15 STDEV $148.56
CV $0.42 CV $0.39 CV $0.forty
2010 Quartile 2 PCGDP 1990 Quartile ii PCGDP 1970 Quartile 2 PCGDP
MEAN $1985.00 Mean $1281.15 Hateful $985.63
MEDIAN $1791.81 MEDIAN $1124.38 MEDIAN $980.48
STDEV $732.83 STDEV $409.74 STDEV $269.82
CV $0.37 CV $0.32 CV $0.27
2010 Quartile 3 PCGDP 1990 Quartile 3 PCGDP 1970 Quartile 3 PCGDP
Mean $6222.87 MEAN $3923.85 Hateful $2705.56
MEDIAN $6019.89 MEDIAN $3696.97 MEDIAN $2450.51
STDEV $7102.65 STDEV $4662.93 STDEV $6055.21
CV $1.14 CV $1.19 CV $2.24
2010 Quartile 4 PCGDP 1990 Quartile iv PCGDP 1970 Quartile 4 PCGDP
MEAN $34,079.96 MEAN $25,294.50 Mean $21,088.95
MEDIAN $32,674.26 MEDIAN $26,438.fourscore MEDIAN $17,132.54
STDEV $sixteen,626.86 STDEV $12,006.03 STDEV $18,086.48
CV $0.49 CV $0.47 CV $0.86

Table 5. Change in Hateful PCGDP by Quartile, 1970–90

Change in Hateful PCGDP by Quartile, 1970–90
Quartile Absolute Percentage
1 (lowest PCGDP) 39.22 10.63
2 143.90 14.68
3 1246.46 fifty.87
4 (highest PCGDP) 9306.26 54.32
N  =   159

Tabular array 6. Change in Hateful PCGDP past Quartile, 1990–2010

Change in Mean PCGDP by Quartile, 1990–2010
Quartile Absolute Percent
1 (lowest PCGDP) $72.07 17.66
two $667.43 59.36
3 $2322.92 62.83
four (highest PCGDP) $6235.46 23.58
N  =   159

Fig. 5

Fig. five. Ln(e 0) by Ln(PCT80   +), 1970.

Fig. 6

Fig. 6. Ln(east 0) by Ln(PCT80   +), 1990.

Fig. 7

Fig. 7. Ln(e 0) by Ln(PCT80   +), 2010.

Read total chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/S0169716118300014

Fiscal, Macro and Micro Econometrics Using R

Roberto S. Mariano , Suleyman Ozmucur , in Handbook of Statistics, 2020

3.2.1 Descriptive statistics for GDP and Gdp deflator

Stats package provides minimum, maximum, outset and third quartiles, mean and median ( R Core Team, 2018). Real Gross domestic product growth from the showtime quarter of 1998 to the get-go quarter of 1999 was only 0.26%, while GDP deflator growth was 8.half-dozen%. On the other hand, the real GDP growth from the 2d quarter of 2017 to the second quarter of 2018 was 6.0%, while the GDP deflator growth was 3.4%. The mean for all 78 observations was 5.2% for real Gross domestic product, and three.9% for GDP deflator.

>   summary(HOS41Q_1521dbn)

  engagement   NOM   GDP   DEF

Min.   :1999-01-01 00:00:00   Min.   : i.444   Min.   :0.2596   Min.   :-one.456

1st Qu.:2003-x-24 00:00:00   1st Qu.: 8.516   1st Qu.:3.9845   1st Qu.: 2.456

Median :2008-08-xvi 00:00:00   Median   : 9.338   Median   :5.6681   Median : 3.907

Mean   :2008-08-15 21:13:50   Mean   : nine.261   Mean   :v.2082   Mean   : 3.866

tertiary Qu.:2013-06-08 06:00:00   third Qu.:10.320   tertiary Qu.:vi.6317   3rd Qu.: 5.235

Max.   :2018-04-01 00:00:00   Max.   :15.403   Max.   :viii.9131   Max.   : 9.665

  Time

Min.   :1999

1st Qu.:2004

Median :2009

Mean   :2009

3rd Qu.:2013

Max.   :2018

Kernel density estimates show that real GDP growth is skewed to the left (with a skewness coefficient of −   0.59), and slightly less peaked than a normal distribution (with an backlog kurtosis of −   0.fifteen). Information technology has a hateful of 5.2 with a standard deviation of i.86, median of 5.67, and with a manner close to 7. On the other hand, GDP deflator has a relatively close to a normal distribution (with a skewness coefficient of 0.15, and an excess kurtosis 0.19). Information technology has a mean of iii.87, with a standard deviation of 2.i.

Psych packet (Revelle, 2019) describes control too provides, standard deviation (same every bit sd in Stats package) mean absolute deviation, trimmed mean, skewness, and kurtosis. Moments Parcel has tests for skewness and kurtosis.

>   yts3   <- yts[,2:4]

>   circular(describe(yts3),2)

  vars   n mean   sd median trimmed   mad   min   max range   skew kurtosis   se

NOM   1 78 nine.26 2.xxx   9.34   nine.31 ane.38   1.44 15.40 13.96 -0.47   1.70   0.26

GDP   two 78 five.21 ane.86   five.67   5.33 one.77   0.26   8.91   viii.65 -0.59   -0.xv   0.21

DEF   3 78 iii.87 two.10   3.91   iii.87 ii.10   -i.46   9.66 11.12   0.05   0.19   0.24

Read total chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/S0169716119300343

Landmark Summaries

Andrew F. Siegel , in Practical Business Statistics (Seventh Edition), 2016

Extremes, Quartiles, and Box Plots

I of import use of percentiles is every bit landmark summary values. You lot tin use a few percentiles to summarize important features of the entire distribution. You have already seen the median, which is the 50th percentile since it is ranked halfway betwixt the smallest and largest. The extremes, the smallest and largest values, are ofttimes interesting. These are the 0th and 100th percentiles, respectively. To complete a small ready of landmark summaries, nosotros also use the quartiles , divers equally the 25th and 75th percentiles.

It may come as a surprise to learn that statisticians cannot concord on exactly what a quartile is and that in that location are many different means to compute a quartile. The idea is articulate: Quartiles are the information values ranked one-quaternary of the way in from the smallest and largest values; yet, there is ambiguity as to exactly how they should be computed. John Tukey, who created exploratory data assay, defines quartiles equally follows xiii :

1.

Detect the median rank, (1   + north)/2, and discard whatsoever fraction. For example, with n  =   thirteen, employ (1   +   13)/ii   =   7. However, with n  =   24, you lot would drop the decimal part of (one   +   24)/2   =   12.5 and utilize 12.

two.

Add i to this and dissever past 2. This gives the rank of the lower quartile. For example, with n  =   13, you lot observe (1   +   7)/2   =   four. With due north  =   24, you find (i   +   12)/ii   =   6.5, which tells you to average the information values with ranks half dozen and 7.

3.

Subtract this rank from (north  +   1). This gives the rank of the upper quartile. For example, with n  =   xiii, you have (13   +   one)     4 =   x. With northward  =   24, you have (i   +   24)     6.5   =   18.5, which tells you to boilerplate the information values with ranks 18 and 19.

The quartiles themselves may so be institute based on these ranks. A general formula for the ranks of the quartiles, expressing the steps but given, may be written as follows and shows how the computer finds these numbers:

Ranks for the Quartiles

Rank of lower Quartile = 1 + int 1 + n / 2 2

Rank of Upper Quartile   = northward  +   1     Rank of Lower Quartile

where int refers to the integer part part, which discards any decimal portion.

The v-number summary is defined as the post-obit set of 5 landmark summaries: smallest, lower quartile, median, upper quartile, and largest.

The 5-Number Summary

The smallest data value (the 0th percentile).

The lower quartile (the 25th percentile, one-4th of the way in from the smallest).

The median (the 50th percentile, in the middle).

The upper quartile (the 75th percentile, iii-fourths of the way in from the smallest and 1-fourth of the way in from the largest).

The largest data value (the 100th percentile).

These v numbers taken together give a clear await at many features of the unprocessed data. The two extremes betoken the range spanned by the data, the median indicates the heart, the ii quartiles indicate the edges of the "eye half of the data," and the position of the median betwixt the quartiles gives a crude indication of skewness or symmetry.

The box plot is a motion picture of the v-number summary, every bit shown in Fig. 4.ii.1. The box plot serves the aforementioned purpose as a histogram—namely, to provide a visual impression of the distribution—but it does this in a different way. Box plots bear witness less detail and are therefore more useful for seeing the big movie and comparing several groups of numbers without the distraction of every detail of each group. The histogram is withal preferable for a more than detailed look at the shape of the distribution.

Fig. 4.ii.i. A box plot displays the 5-number summary for a univariate information fix, giving a quick impression of the distribution.

The detailed box plot is a box plot, modified to display the outliers, which are identified by labels (which are besides used for the virtually farthermost observations that are not outliers). These labels tin can be very useful in calling attention to cases that may deserve special attention. For the purpose of creating a detailed box plot, outliers are divers as those data points (if any) that are far from the center of the data fix. Specifically, a large data value volition be declared to exist an outlier if information technology is bigger than

Upper quartile + one.5 × Upper quartile Lower quartile

A small data value will exist declared to exist an outlier if it is smaller than

Lower quartile 1.v × Upper quartile Lower quartile

This definition of outliers is due to Tukey. fourteen In addition to displaying and labeling outliers, you may also label the most extreme cases that are non outliers (1 on each side) since these are frequently worth special attention. See Fig. 4.2.ii for a comparison of a box plot and a detailed box plot.

Fig. 4.2.ii. Box plot (bottom) and detailed box plot (top) for CEO compensation in the engineering industry. Both plots evidence the five-number summary, only the detailed plot provides further important data about outliers (and the largest and smallest values that are not outliers) by identifying the companies. In this case, outliers represent firms with exceptionally high CEO compensation.

Example

Executive Bounty

How much coin practise main executive officers brand? Table 4.ii.1 shows the compensation (salary and bonus) for the year 2000 received past CEOs of major technology companies. The information have been sorted and ranked, with the v-number summary values indicated in the tabular array. There are northward  =   23 firms listed; hence, the median ($one,723,600) has rank (one   +   23)/2   =   12, which is the rank of Irwin Jacobs, and so CEO of Qualcomm. The lower quartile ($1,211,650) has rank (i   +   12)/two   =   6.v and is the boilerplate of CEO compensation at Lucent Technologies (rank 6) with Cisco Systems (rank 7). The upper quartile ($ii,792,350) has rank 23   +   i     half dozen.5   =   17.5 and is the boilerplate of compensation for Micron Technology (rank 17) with EMC (rank xviii). The five-number summary of CEO compensation for these 23 engineering companies is therefore

Smallest $0
Lower quartile   i,211,650
Median   1,723,600
Upper quartile   2,792,350
Largest x,000,000

Table 4.two.i. CEO Compensation in Technology

Company Executive Bacon and Bonus Rank Five-Number Summary
IBM a Louis V. Gerstner Jr. $10,000,000 23 Largest is $ten,000,000
Advanced Micro Devices a W. J. Sanders III   7,328,600 22
Sun Microsystems Scott G. McNealy   iv,871,300 21
Compaq Estimator Michael D. Capellas   iii,891,000 20
Practical Materials James C. Morgan   3,835,800 nineteen
EMC Michael C. Ruettgers   2,809,900 18
Upper quartile is $2,792,350
Micron Technology Steven R. Appleton   two,774,800 17
Hewlett-Packard Carleton S. Fiorina 2,766,300 16
Motorola Christopher B. Galvin   2,525,000 15
National Semiconductor Brian L. Halla   two,369,800 14
Texas Instruments Thomas J. Engibous   two,096,200 13
Qualcomm Irwin Mark Jacobs   1,723,600 12 Median is $one,723,600
Unisys Lawrence A. Weinbach   one,716,000 11
Pitney Bowes Michael J. Critelli   ane,519,000 x
NCR Lars Nyberg   1,452,100   9
Harris Phillip W. Farmer   1,450,000   viii
Cisco Systems John T. Chambers   1,323,300   7
Lower quartile is $1,211,650
Lucent Technologies Richard A. McGinn   1,100,000   half dozen
Silicon Graphics Robert R. Bishop   692,300   5
Microsoft Steven A. Ballmer   628,400   4
Western Digital Matthew Due east. Massengill   580,500   3
Oracle Lawrence J. Ellison   208,000   two
Apple Calculator Steven P. Jobs   0   1 Smallest is $0
a
These values are outliers.

Source: Data from Wall Street Journal, Apr 12, 2001, pp. R12–R15. Their source is William K. Mercer Inc., New York.

Are there any outliers? If nosotros compute using the quartiles, at the high end, whatever bounty larger than 2,792,350   +   1.5   ×   (two,792,350     1,211,650)   =   $5,163,400 will be an outlier.

Thus, the two largest data values, IBM and Avant-garde Micro Devices (AMD), are outliers. At the low end, whatever compensation smaller than one,211,650     ane.5   ×   (2,792,350     1,211,650)   =     one,159,400, a negative number, would be an outlier. Since the smallest compensation is $0 (for Steve Jobs at Apple Calculator), in that location are no outliers at the low terminate of the distribution.

Box plots (in two styles) for these 23 technology companies are displayed in Fig. 4.ii.two. The detailed box plot conveys more information by identifying the outlying firms (and the most extreme firms that are not outliers). Although ordinarily y'all would utilise only one of the 2 styles, nosotros evidence both here for comparison.

One of the strengths of box plots is their ability to aid you concentrate on the important overall features of several data sets at once, without being overwhelmed by the details. Consider the CEO compensation for the year 2000 for major companies in utilities, financial, and energy likewise as technology. 15 This consists of four private data sets: a univariate data set (a grouping of numbers) for each of these four industry groups. Thus, there is a 5-number summary and a box plot for each grouping.

By placing box plots virtually each other and on the same scale in Fig. four.2.three, we facilitate the comparison of typical CEO compensation from 1 industry grouping to another. Note, for the detailed box plots, how helpful it is to accept exceptional CEO firms labeled, compared to the box plots that brandish only the five-number summaries. Although the highest-paid CEOs come from financial companies, this industry is similar to the others virtually the lower end (eg, look at the lower quartiles). While risks come forth with the big bucks (eg, Enron, the outlier in the Utilities group, filed for bankruptcy protection in Dec 2001 and its CEO resigned in January 2002), would not it be dainty to exist in a job category where the lower quartile pays over a million dollars a twelvemonth? Another way to expect at this situation is to recognize that statistical methods accept highlighted Enron as an unusual instance based on data available well before the difficulties of this company became famous.

Fig. iv.2.3. Box plots for CEO compensation in major firms in selected industry groups, arranged on the same calibration so that you may easily compare i group to another. The acme effigy gives details about the outliers (and most extreme nonoutliers) while the bottom figure shows but the five-number summary.

Which kind of box plot is the all-time? Using computers, it is piece of cake to display the outliers (if whatever) separately in the detailed box plot. However, for some purposes these additional details would exist distracting, and the (ordinary) box plot would be preferred, specially if your focus is primarily on the eye of the distribution of the data. On the other hand, if the design of occurrence of outliers is of interest, then the detailed box plot would be best then that you tin come across where they are located.

Example

Data Mining the Donations Database

Consider the donations database of information on 20,000 people available at the companion site. In a data-mining example in Chapter one, we found a greater percentage of donations among people who had given more often over the previous ii years. Just what about the size of the donations given? Practice those who donated more frequently tend to requite larger or smaller contributions than the others? Box plots can aid us see what is going on in this large database.

We get-go focus attention on just the 989 donors out of the 20,000 people (eliminating, for now, the 19,011 who did not donate in response to the mailing). Side by side, using the ninth column of the database, nosotros separate these 989 donors into iv groups: 381 made but i previous gift over the past two years, 215 made ii, 201 made three, and 192 made four or more. Taking the current donation corporeality (from the first cavalcade), we now take four univariate data sets.

One of the advantages of data mining is that when we suspension upwardly the database into interesting parts like these, we have enough information in each part to work with. In this case, although the smallest of the four pieces is less than 1% of the database, information technology is all the same enough data (192 people) to see the distribution.

Box plots of the current donation amount for these four groups are shown in Fig. 4.2.4. Annotation the trend for larger donations, typically, to come up from those who have given fewer gifts! You tin can run into this past noticing that the central box moves to the left (toward smaller donation amounts) every bit you go up in the effigy toward those with more previous gifts. It seems that those who requite more often tend to give less, not more each time! This reminds united states of the importance of those who donate less frequently.

Fig. four.ii.four. Box plots showing that those who donated more frequently (the number of previous gifts, increasing as you lot move upwardly) tend to requite smaller current donation amounts (every bit you can see from the more often than not leftward shift in the box plots equally you motion upward from ane box plot to the side by side).

Read total chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780128042502000043

Introduction to Descriptive Statistics

Oliver C. Ibe , in Fundamentals of Applied Probability and Random Processes (Second Edition), 2014

8.5.seven Box and Whiskers Plot

The box and whisker diagram (or box plot) is a mode to visually organize data into fourths or quartiles. The diagram is made up of a "box," which lies between the beginning and third quartiles, and "whiskers" that are direct lines extending from the ends of the box to the maximum and minimum data values. Thus, the middle two-fourths are enclosed in a "box" and lower and upper fourths are drawn as whiskers. The procedure for drawing the diagram is equally follows:

ane.

Adapt the data in increasing order

two.

Find the median

3.

Discover the commencement quartile Q 1, which is the median of the lower half of the data set up; and the third quartile, Q 3, which is the median of the upper half of the data set.

4.

On a line, mark points at the median, Q 1, Q 3, the minimum value of the information fix and the maximum value of the data set.

5.

Depict a box that lies between the get-go and third quartiles and thus represents the middle fifty% of the data.

6.

Draw a line from the first quartile to the minimum data value, and another line from the third quartile to the maximum data value. These lines are the whiskers of the plot.

Thus the box plot identifies the eye 50% of the information, the median, and the extreme points. The plot is illustrated in Figure 8.10 for the rank-lodge data set that we have used in previous sections:

Figure 8.10. The Box and Whiskers Plot

12 , 34 , 48 , 50 , 50 , 54 , 56 , 65 , 66 , lxxx , 88 , xc

As discussed earlier, the median is M  = Q ii  =   55, the first quartile is Q ane  =   49, and the third quartile is Q 3  =   73. The minimum information value is 12, and the maximum data value is 90. These values are all indicated in the figure.

When collecting data, sometimes a outcome is collected that seems "wrong" because information technology is much college or much lower than all of the other values. Such points are known as "outliers." These outliers are commonly excluded from the whisker portion of the box and whiskers diagram. They are plotted individually and labeled as outliers.

As discussed earlier, the interquartile range, IQR, is the difference betwixt the third quartile and the first quartile. That is, IQR   = Q 3  Q i, which is the width of the box in the box and whiskers diagram. The IQR is i of the measures of dispersion, and statistics assumes that data values are clustered effectually some central value. The IQR can exist used to tell when some of the other values are "too far" from the cardinal value. In the box-and-whiskers diagram, an outlier is any data value that lies more than one and a one-half times the length of the box from either finish of the box. That is, if a data betoken is below Q 1    1.v   ×   IQR or above Q 3  +   i.5   ×   IQR, it is viewed as being too far from the central values to be reasonable. Thus, the values for Q 1    1.5   ×   IQR and Q iii  +   one.five   ×   IQR are the "fences" that mark off the "reasonable" values from the outlier values. That is, outliers are data values that lie outside the fences. Thus, we define

a.

Lower fence = Q one    ane.5   ×   IQR

b.

Upper fence = Q three  +   1.5   ×   IQR

For the example in Figure eight.10, IQR   = Q iii  Q i  =   24     IQR   ×   i.5   =   36. From this nosotros have that the lower fence is at Q 1    36   =   49     36   =   13, and the upper contend is at Q 3  +   36   =   73   +   36   =   109. The only data value that is outside the fences is 12; all other data values are within the two fences. Thus, 12 is the only outliner in the information set up.

The following example illustrates how to describe the box and whiskers plot with outliers. Supposed that we are given a new data gear up, which is 10, 12, 8, i, x, thirteen, 24, fifteen, 15, 24. First, nosotros rank-order the information in an increasing order of magnitude to obtain:

1 , eight , x , 10 , 12 , xiii , 15 , 15 , 24 , 24

Since there are 10 entries, the median is the average of the 5th and sixth numbers. That is, M  =   (12   +   13)/2   =   12.5. The lower half data fix is ane, 8, 10, x, 12 whose median is Q 1  =   10. Similarly, the upper half information set is 13, 15, 15, 24, 24 whose median is Q 3  =   15. Thus, IQR   =   xv     10   =   v, the lower fence is at 10     (1.v)(5)   =   2.5 and the upper fence is at 15   +   (1.5)(5)   =   22.5. Considering the information values 1, 24 and 24 are outside the fences, they are outliers. The two values of 24 are stacked on top of each other. This is illustrated in Figure viii.xi. The outliers are explicitly indicated in the diagram.

Figure 8.11. Example of Box and Whiskers Plot with Outliers

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128008522000080

Getting Started with RapidMiner

Vijay Kotu , Bala Deshpande , in Data Science (Second Edition), 2019

15.3 Data Visualization Tools

Once a dataset is read into RapidMiner, the next stride is to explore the dataset visually using a variety of tools. Before visualization is discussed, however, it is a good idea to check the metadata of the imported information to verify if all the right information is at that place. When the simple procedure described in Section fifteen.ii is run (exist sure to connect the output of the read operator to the "result" connector of the process), an output is posted to the Results view of RapidMiner. The data table tin be used to verify that indeed the data has been correctly imported under the Information tab on the left (encounter Fig. 15.10).

Figure fifteen.x. Results view that is shown when the information import process is successful.

By clicking on the Statistics tab (come across Fig. 15.eleven), one can examine the blazon, missing values, and basic statistics for all the imported dataset attributes. The information blazon of each attribute (integer, real, or binomial), and some bones statistics tin also be identified. This high-level overview is a practiced fashion to ensure that a dataset has been loaded correctly and exploring the data in more detail using the visualization tools described later is possible.

Figure xv.eleven. Metadata is visible nether the Statistics tab.

There are a variety of visualization tools available for univariate (one aspect), bivariate (two attributes), and multivariate analysis. Select the Charts tab in the Results view to access whatsoever of the visualization tools or plotter. General details virtually visualization are available in Chapter 3, Information Exploration.

Univariate Plots

1.

Histogram: A density estimation for numeric plots and a counter for chiselled ones.

2.

Quartile (Box and Whisker): Shows the mean value, median, standard departure, some percentiles, and any outliers for each attribute.

three.

Series (or Line): Commonly best used for time series data.

Bivariate Plots

All second and 3D charts show dependencies between tuples (pairs or triads) of variables. 3

i.

Scatter: The simplest of all 2D charts, which shows how one variable changes with respect to another. RapidMiner allows the utilise of color; the points tin be colored to add a tertiary dimension to the visualization.

2.

Scatter Multiple: Allows one centrality to be stock-still to i variable while cycling through the other attributes.

three.

Scatter Matrix: Allows all possible pairings between attributes to be examined. Colour every bit usual adds a tertiary dimension. Be conscientious with this plotter because as the number of attributes increases, rendering all the charts tin can slow down processing.

4.

Density: Like to a 2D scatter chart, except the background may be filled in with a color gradient respective to one of the attributes.

5.

SOM: Stands for a self-organizing map. It reduces the number of dimensions to 2 past applying transformations. Points that are similar along many attributes will exist placed close together. It is basically a clustering visualization method. At that place are more details in Chapter 8, Model Evaluation, on clustering. Note that SOM (and many of the parameterized reports) practice not run automatically, so switch to that report, there volition be a bare screen until the inputs are fix so in the instance of SOM the calculate button is pushed.

Multivariate Plots

1.

Parallel: Uses i vertical axis for each attribute, thus, there are every bit many vertical axes as there are attributes. Each row is displayed equally a line in the chart. Local normalization is useful to sympathize the variance in each variable. All the same, a deviation plot works better for this.

2.

Divergence: Same equally parallel, but displays hateful values and standard deviations.

iii.

Scatter 3D: Quite similar to the 2D scatter chart but allows a three-dimensional visualization of three attributes (iv, the color of the points is included)

4.

Surface: A surface plot is a 3D version of an surface area plot where the background is filled in.

These are not the but bachelor plotters. Some additional ones are non described here such as pie, bar, ring, block charts, etc. Generating whatsoever of the plots using the GUI is pretty much self-explanatory. The merely words of circumspection are that when a large dataset is encountered, generating some of the graphics intensive multivariate plots can exist quite time consuming depending on the bachelor RAM and processor speed.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128147610000150

Basic Statistics

Chris Tsokos , Rebecca Wooten , in The Joy of Finite Mathematics, 2016

Disquisitional Thinking

8.4.

The standard sample mean, the 25% trimmed hateful, the minimum, the maximum, the quartiles, and the percentiles are all types of what blazon of average? Which ane practice you believe is the all-time to guess the truthful mean μ .

8.5.

Compute the standard arithmetic mean for the data given in the frequency table below.

Explain the meaning and usefulness of the results.

8.6.

Compute the standard sample variance and divergence for the data: 70, 79, 83, and 97.

Discuss their meaning and usefulness.

eight.vii.

Compute the sample variance and standard deviation for the data: 40, 49, 53, and 80.

Explicate their usefulness.

8.8.

Compute the expected value for the data given in the relative frequency table below.

Explicate what the expected value means with respect to the given data.

viii.9.

Show that the standard sample mean is an unbiased calculator of the population mean; that is, given E x = μ and using summation note prove that E x ¯ = μ .

Read full chapter

URL:

https://world wide web.sciencedirect.com/scientific discipline/article/pii/B9780128029671000085

Information Exploration

Vijay Kotu , Bala Deshpande , in Information Scientific discipline (2nd Edition), 2019

Quartile

A box whisker plot is a simple visual way of showing the distribution of a continuous variable with data such as quartiles, median, and outliers, overlaid by mean and standard divergence. The chief allure of box whisker or quartile charts is that distributions of multiple attributes tin exist compared side by side and the overlap between them can exist deduced. The quartiles are denoted by Q1, Q2, and Q3 points, which point the data points with a 25% bin size. In a distribution, 25% of the data points volition be beneath Q1, 50% will be below Q2, and 75% will be below Q3.

The Q1 and Q3 points in a box whisker plot are denoted by the edges of the box. The Q2 point, the median of the distribution, is indicated by a cross line within the box. The outliers are denoted by circles at the end of the whisker line. In some cases, the mean point is denoted by a solid dot overlay followed by standard difference every bit a line overlay.

Fig. 3.7 shows that the quartile charts for all four attributes of the Iris dataset are plotted adjacent. Petal length tin exist observed as having the broadest range and the sepal width has a narrow range, out of all of the four attributes.

Figure iii.7. Quartile plot of Iris dataset.

I attribute tin also exist selected—petal length—and explored farther using quartile charts past introducing a class label. In the plot in Fig. 3.8, we tin can encounter the distribution of three species for the petal length measurement. Like to the previous comparison, the distribution of multiple species can be compared.

Figure iii.8. Class-stratified quartile plot of petal length in Iris dataset.

Read total chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128147610000034