Handout 1: Basic Definitions, Samples and Populations, and Sampling Methods Reading Assignment: Sections 1.1, 1.2, Chapter 2

Statistics v. statistics Statistics (capital ‘S’) is a collection of techniques and procedures for analyzing data. These techniques and procedures are used to help people make decisions when faced with uncertainty. The numbers we calculate from data to make summaries, estimations, predictions, etc. are statistics (lower-case ‘s’). Think of statistics you have calculated or seen before and list them below (for instance batting average).

Many see Statistics as a Mathematics course, but it is important to understand the difference. While mathematical concepts are used, Statistics is a distinct scientific field. We use math in order to allow us to make sense of and draw meaningful conclusions from data.

Data Any characteristic that can differ from one individual to the next is called a variable. We call variables that are measured, or somehow determined, and collected on a number of individuals data. Often, we organize data into a dataset, a row-and-column display (think spreadsheet). Sometimes, these individuals are called subjects or observations to give them a more specific branding - we might call measurements on an individual person from a subject while measurements collected on an individual factory an observation. The figure below displays ten subjects (students) who filled out the ‘Getting to Know You’ Survey and is organized into a Microsoft Excel spreadsheet.

Variables are of one of two main types, categorical or numerical. Each of these variable types can also be separated even further, as seen in the figure below. Data that consist of groups are known as categorical data while data that measure a ‘quantity’ (e.g. how much, how many) are numerical data. Sometimes, categorical data may be coded as numbers (e.g. male = 0 and female = 1), but this data is still categorical. What level of data are the variables in the Excel spreadsheet?

1

Also, place additional examples of the measurement levels of data in the figure below.

Categorical

Numerical

Nominal

Interval

Ordinal

Ratio

Population v. Sample Consider the figure of the first ten subjects from the ‘Getting to Know You’ Survey. Let’s treat this as an entire class of ten STAT 2160 students. If we were only interested in using these measurements to describe students in the class the data is from, we would have population data. On the other hand, if we wanted to use these measurements to describe a larger collection of students (e.g., students from all sections of STAT 2160 during the semester this data was collected), we would have sample data. Usually, the methods to summarize data are the same for populations and samples. When these methods differ between samples and populations, we will differentiate them. A large focus of this course is to make inference about a larger population using a sample (an important reason to determine whether data is obtained from a sample or a population).

Parameter v. statistic Recall that statistics are the numbers we calculate from data to make summaries. To add to this definition, now that we know some more information, statistics are calculated from sample data. Parameters are the numbers we calculate from population data to make summaries. Any easy way to remember this distinction is that both population and parameter begin with ‘p’ and that sample and statistic begin with ‘s’ Both of these measurements can be classified as descriptive statistics, which we will look at more in depth later.

Sampling Methods The goal of sampling is to obtain a sample that is unbiased, representative of the population. A probability sampling method is one in which every unit in the population has a chance of being selected in the sample. Sometimes, these methods may be costly and time consuming, but are the only way to ensure the sample is unbiased. We will focus on the simple random sample in this class, but will also discuss other sampling methods. For a more in depth look at these different sampling methods, please refer to the course pack. Simple Random Sample - Each observation has the same probability of being chosen at each ‘draw’. K-in-1 Systematic Sample - selects every kth subject on a list or sequence. Stratified Random Sample - partition the population into subsets, strata, and select a simple random sample from each stratum. Cluster Sample - randomly select clusters then observe entire cluster.

2

Informal and less scientific methods are known as nonprobability samples. Some examples of nonprobability samples follow: Voluntary Response - Subjects choose whether or not to respond. Convenience - Researcher decides which subjects are chosen. Can you think of examples of these nonprobability sampling methods you have encountered in your life? Can we trust the results from these types of methods?

These methods are easy ways of collecting data, but they are typically biased and scientifically suspect.

If a population is small enough, it may make sense to collect data on the whole population instead of a sample. This method is called a census. For some populations, this is not feasible or is impossible. Will you test every can of soda for caloric content? Can you test every can?

3

iClicker Questions In each of the following situations, indicate whether the sampling method is a (a) simple random sampling, (b) k-in-1 systematic sampling, (c) stratified random sampling, or (d) cluster sampling. A class of 200 students is numbered from 1 to 200, and 60 students are randomly chosen from the class. (a) simple random sampling (b) k-in-1 systematic sampling (c) stratified random sampling (d) cluster sampling. In a class of 200 students the students have different class rankings (freshmen, sophomore, junior, and senior). A random sample from each class ranking is taken. (a) simple random sampling (b) k-in-1 systematic sampling (c) stratified random sampling (d) cluster sampling. An airline company randomly chooses one flight from a list of all flights taking place that day. All passengers on that selected flight are asked to fill out a survey on meal satisfaction. (a) simple random sampling (b) k-in-1 systematic sampling (c) stratified random sampling (d) cluster sampling. In a factory producing television sets, every 100th set produced is inspected. (a) simple random sampling (b) k-in-1 systematic sampling (c) stratified random sampling (d) cluster sampling. Source: Utts and Heckard. Mind on Statistics. 3rd ed. United States: Duxbury, 2007. 107. Print.

4

Types of Surveys Face-to-face interview Phone interview Written Questionnaire Web-based Questionnaire (espn.go.com) 900 numbers What are some other ways surveys are taken or data is collected?

Survey Errors As we will see later, sample outcomes are only estimates for the truth in the population. Two random samples will obtain different individuals and the results will differ somewhat. These differing results are expected and known as sampling error, or sampling variability, and it is not a bad thing. There are, however, ways that people sample that can lead to biased sampling and these are as follows: Coverage errors The sampling frame, the ‘list’ from which you are selecting your sample, excludes some segment of the target population Nonresponse errors Lack of response from certain segments of the target population. Measurement errors Respondents answer ‘inaccurately’ because of question wording, question ordering, interviewer effect, or other external influences.

Source: deweydefeatstruman.com

Despite his success in 1936, George Gallup failed miserably in trying to predict the winner of the 1948 U.S. presidential election. His organization, as well as two others, predicted that Thomas Dewey would beat incumbent Harry Truman. All three used what is called quota sampling. The interviewers were told to find a certain number, or quota, of each of several types of people. For example, they might have been told to interview six women under age 40, one of whom was black and the other five of whom were white. Why do you think these polls failed to predict the true winner?

5

iClicker Question In each of the following situations, indicate whether the potential bias is a (a) coverage error, (b) nonresponse error, or (c) measurement error. A survey question asked of unmarried men was, ‘What is the most important feature you consider when deciding whether to date somebody?’ The results were found to depend on whether the interviewer was male or female. (a) coverage error (b) nonresponse error (c) measurement error In a study of womens opinions about community issues, investigators randomly selected a sample of households and interviewed a woman from each selected household. When no woman was present in a selected household, a next-door neighbor was interviewed instead. The survey was done during daytime hours, so working women might have been disproportionately missed. (a) coverage error (b) nonresponse error (c) measurement error A telephone survey of 500 residences in conducted. People refused to talk to the interviewer in 200 of the residences. (a) coverage error (b) nonresponse error (c) measurement error Source: Utts and Heckard. Mind on Statistics. 3rd ed. United States: Duxbury, 2007. 107. Print

6

.

Statistics v. statistics Statistics (capital ‘S’) is a collection of techniques and procedures for analyzing data. These techniques and procedures are used to help people make decisions when faced with uncertainty. The numbers we calculate from data to make summaries, estimations, predictions, etc. are statistics (lower-case ‘s’). Think of statistics you have calculated or seen before and list them below (for instance batting average).

Many see Statistics as a Mathematics course, but it is important to understand the difference. While mathematical concepts are used, Statistics is a distinct scientific field. We use math in order to allow us to make sense of and draw meaningful conclusions from data.

Data Any characteristic that can differ from one individual to the next is called a variable. We call variables that are measured, or somehow determined, and collected on a number of individuals data. Often, we organize data into a dataset, a row-and-column display (think spreadsheet). Sometimes, these individuals are called subjects or observations to give them a more specific branding - we might call measurements on an individual person from a subject while measurements collected on an individual factory an observation. The figure below displays ten subjects (students) who filled out the ‘Getting to Know You’ Survey and is organized into a Microsoft Excel spreadsheet.

Variables are of one of two main types, categorical or numerical. Each of these variable types can also be separated even further, as seen in the figure below. Data that consist of groups are known as categorical data while data that measure a ‘quantity’ (e.g. how much, how many) are numerical data. Sometimes, categorical data may be coded as numbers (e.g. male = 0 and female = 1), but this data is still categorical. What level of data are the variables in the Excel spreadsheet?

1

Also, place additional examples of the measurement levels of data in the figure below.

Categorical

Numerical

Nominal

Interval

Ordinal

Ratio

Population v. Sample Consider the figure of the first ten subjects from the ‘Getting to Know You’ Survey. Let’s treat this as an entire class of ten STAT 2160 students. If we were only interested in using these measurements to describe students in the class the data is from, we would have population data. On the other hand, if we wanted to use these measurements to describe a larger collection of students (e.g., students from all sections of STAT 2160 during the semester this data was collected), we would have sample data. Usually, the methods to summarize data are the same for populations and samples. When these methods differ between samples and populations, we will differentiate them. A large focus of this course is to make inference about a larger population using a sample (an important reason to determine whether data is obtained from a sample or a population).

Parameter v. statistic Recall that statistics are the numbers we calculate from data to make summaries. To add to this definition, now that we know some more information, statistics are calculated from sample data. Parameters are the numbers we calculate from population data to make summaries. Any easy way to remember this distinction is that both population and parameter begin with ‘p’ and that sample and statistic begin with ‘s’ Both of these measurements can be classified as descriptive statistics, which we will look at more in depth later.

Sampling Methods The goal of sampling is to obtain a sample that is unbiased, representative of the population. A probability sampling method is one in which every unit in the population has a chance of being selected in the sample. Sometimes, these methods may be costly and time consuming, but are the only way to ensure the sample is unbiased. We will focus on the simple random sample in this class, but will also discuss other sampling methods. For a more in depth look at these different sampling methods, please refer to the course pack. Simple Random Sample - Each observation has the same probability of being chosen at each ‘draw’. K-in-1 Systematic Sample - selects every kth subject on a list or sequence. Stratified Random Sample - partition the population into subsets, strata, and select a simple random sample from each stratum. Cluster Sample - randomly select clusters then observe entire cluster.

2

Informal and less scientific methods are known as nonprobability samples. Some examples of nonprobability samples follow: Voluntary Response - Subjects choose whether or not to respond. Convenience - Researcher decides which subjects are chosen. Can you think of examples of these nonprobability sampling methods you have encountered in your life? Can we trust the results from these types of methods?

These methods are easy ways of collecting data, but they are typically biased and scientifically suspect.

If a population is small enough, it may make sense to collect data on the whole population instead of a sample. This method is called a census. For some populations, this is not feasible or is impossible. Will you test every can of soda for caloric content? Can you test every can?

3

iClicker Questions In each of the following situations, indicate whether the sampling method is a (a) simple random sampling, (b) k-in-1 systematic sampling, (c) stratified random sampling, or (d) cluster sampling. A class of 200 students is numbered from 1 to 200, and 60 students are randomly chosen from the class. (a) simple random sampling (b) k-in-1 systematic sampling (c) stratified random sampling (d) cluster sampling. In a class of 200 students the students have different class rankings (freshmen, sophomore, junior, and senior). A random sample from each class ranking is taken. (a) simple random sampling (b) k-in-1 systematic sampling (c) stratified random sampling (d) cluster sampling. An airline company randomly chooses one flight from a list of all flights taking place that day. All passengers on that selected flight are asked to fill out a survey on meal satisfaction. (a) simple random sampling (b) k-in-1 systematic sampling (c) stratified random sampling (d) cluster sampling. In a factory producing television sets, every 100th set produced is inspected. (a) simple random sampling (b) k-in-1 systematic sampling (c) stratified random sampling (d) cluster sampling. Source: Utts and Heckard. Mind on Statistics. 3rd ed. United States: Duxbury, 2007. 107. Print.

4

Types of Surveys Face-to-face interview Phone interview Written Questionnaire Web-based Questionnaire (espn.go.com) 900 numbers What are some other ways surveys are taken or data is collected?

Survey Errors As we will see later, sample outcomes are only estimates for the truth in the population. Two random samples will obtain different individuals and the results will differ somewhat. These differing results are expected and known as sampling error, or sampling variability, and it is not a bad thing. There are, however, ways that people sample that can lead to biased sampling and these are as follows: Coverage errors The sampling frame, the ‘list’ from which you are selecting your sample, excludes some segment of the target population Nonresponse errors Lack of response from certain segments of the target population. Measurement errors Respondents answer ‘inaccurately’ because of question wording, question ordering, interviewer effect, or other external influences.

Source: deweydefeatstruman.com

Despite his success in 1936, George Gallup failed miserably in trying to predict the winner of the 1948 U.S. presidential election. His organization, as well as two others, predicted that Thomas Dewey would beat incumbent Harry Truman. All three used what is called quota sampling. The interviewers were told to find a certain number, or quota, of each of several types of people. For example, they might have been told to interview six women under age 40, one of whom was black and the other five of whom were white. Why do you think these polls failed to predict the true winner?

5

iClicker Question In each of the following situations, indicate whether the potential bias is a (a) coverage error, (b) nonresponse error, or (c) measurement error. A survey question asked of unmarried men was, ‘What is the most important feature you consider when deciding whether to date somebody?’ The results were found to depend on whether the interviewer was male or female. (a) coverage error (b) nonresponse error (c) measurement error In a study of womens opinions about community issues, investigators randomly selected a sample of households and interviewed a woman from each selected household. When no woman was present in a selected household, a next-door neighbor was interviewed instead. The survey was done during daytime hours, so working women might have been disproportionately missed. (a) coverage error (b) nonresponse error (c) measurement error A telephone survey of 500 residences in conducted. People refused to talk to the interviewer in 200 of the residences. (a) coverage error (b) nonresponse error (c) measurement error Source: Utts and Heckard. Mind on Statistics. 3rd ed. United States: Duxbury, 2007. 107. Print

6

.