Statistical and probability models: Calculate central tendencies and dispersion of data

# Unit 1: Data collection

Dylan Busa

### Unit 1: Data collecting

By the end of this unit you will be able to:

- Define the types of data.
- Differentiate between grouped and ungrouped data.
- Create an ungrouped data frequency distribution and understand when to use it.
- Collect, organise and interpret univariate numerical data.

## What you should know:

Before you start this unit, make sure you can:

- Use a calculator to do simple arithmetic calculations

## Introduction

Most great discoveries start with a question, a desire to know more about something. Very often, to answer this question, we need to collect data. But collecting data alone is not enough. We need to organise it so that we can analyse it, interpret it, draw conclusions from it and then present our findings to others. This is what statistics is all about.

## Statistics

So, statistics is about answering questions with data and can be applied to fields of study as varied as science, psychology, sport, art, economics, politics, and, of course, social media and technology. We all know that Google and Facebook are constantly collecting huge amounts of data about us by tracking our online habits. They then organise, analyse, and interpret this data to determine what adverts or stories and links they think we will like. In fact, social media is collecting so much data about us that we call it BIG DATA!

Statistics has been around for a long time, dating back to 8th century Arab mathematicians. The foundations of modern statistics were laid in the 17th century with the development of probability theory and it really came into its own in the late 19th and early 20th centuries.

Watch this video called “What is Statistics” for an excellent introduction to the topic.

### Did you know?

Data is the plural form of datum. Therefore, it is not correct to say ‘The data is correct.’ Instead, you should say ‘The data **are** correct.’ Or ‘The datum **is** correct.’ However, very few people follow this strict convention. What is definitely wrong, though, is saying that ‘The datas are correct.’ because **data** is the plural.

Before we start learning more about statistics, it is important to realise that sometimes statistics can be used to deceive, manipulate and even outright lie. Benjamin Disraeli once said that “there are three types of lies – lies, damn lies and statistics.” By studying statistics, you will be better able to spot when someone is trying to manipulate you or persuade you of something that is not actually true. To get an idea of how statistics can be used in these ways, watch the following two videos. The second one is quite a bit longer but well worth the time if you have it.

## Types of data

All statistics relies on data. Without data there is nothing to organise, analyse, interpret, or present. But what are data? Data are a collection of unorganised observations or records about people, places, things, events or anything else and can contain thousands to millions of entries.

Data can be of two main types. **Quantitative** data is **numerical** – it is represented as numbers – like height, time, and cost. **Qualitative** data is **not numerical** and deals with descriptions and observations that cannot be measured like colour, appearance, and type.

Quantitative data can be split into **discrete** data where the values can only be **whole numbers** (like the number of ants in a nest) and **continuous** data where each value can be any **real number** (like the heights of students in a class).

To help you remember the difference remember that ‘quantitative’ is like ‘quantity’.

Types of data:

**Quantitative data:**numerical data that can be measured

**Examples:**length, height, weight, time, cost, and number of people**Discrete data**can only take on whole number values

**Example:**number of females in a room.**Continuous data**can have any real number value

**Example:**height of each male in the room.

**Qualitative data:**things that can be observed but not measured.

**Examples:**colours, sizes, tastes and appearance.

Statistics is divided into two sub-topics: **descriptive statistics** and **inferential statistics**.

- Descriptive statistics deals with actual data collected from or about a group of the people, places, events or things we are studying, called a
**sample**(see Figure 1). - Inferential statistics deals with the predictions and inferences we make about an entire population based on the data we collected from only a subgroup or sample of the population.

In this unit, we will only consider quantitative descriptive statistics. You will be working with numerical data from a sample of a whole population.

To learn more about the differences between descriptive and inferential statistics, watch the video called “Descriptive vs Inferential Statistics”.

### Example 1.1

Jeff wants to sell airtime vouchers to other students at his college. He surveys some students to find out how much data (in Mb) they used in the past week. Next, he asked each of these students which mobile network they used.

- Is the data collected on mobile data used quantitative or qualitative?
- Is the data collected on mobile networks used quantitative or qualitative?

*Solutions*

- The data collected on mobile data used is quantitative because the data values can be written as numbers.
- The data collected on mobile networks is qualitative because each response is not a number but the name of a company.

### Exercise 1.1

- The following data set is of the careers seven college students wish to pursue:

‘Electrician’, ‘Plumber’, ‘Fitter and turner’, ‘Hairdresser’, ‘Accounts clerk’, ‘Chef’, ‘Machine operator’. Is this data quantitative or qualitative? - The following data set is of the number of nails in 8 packets bought from a supplier.

[latex]\scriptsize 23;\text{ }25;\text{ }22;\text{ }26;\text{ }27;\text{ }25;\text{ }21;\text{ }28[/latex]

Categorise the data as fully as possible. - Categorise the following data set of sizes of online videos (in Mb) as fully as possible.

[latex]\scriptsize 134.76,\ 674.52,\ 897.25,\ 789.82,\ 438.52,\ 863.86[/latex]

The full solutions are at the end of the unit.

## Organising data

Suppose a survey was conducted with [latex]\scriptsize 30[/latex] ladies to find out which is their favourite flower. Figure 2 shows the raw data that was collected from this survey.

How many ladies liked roses the most? How many ladies like tulips the most? It is quite difficult to see, at a glance, what the data is telling us. To make sense of the data we need to condense it or organise it in ways that make analysis and interpretation easier. To do this, we can summarise the data in the form of a table as shown in Table 1.

A table like this is called a **frequency distribution**. Which representation (the raw data or the frequency distribution) do you think is better to get an overall idea of the data collected and to answer some basic questions?

### Take note!

**Frequency distribution**

Frequency is how often (or frequently) something occurs. A **frequency distribution**, also called a **frequency table**, is used to organise qualitative and quantitative raw data.

Hopefully you agree that the table is a better way to represent the data if we want to get a basic sense of what it is telling us. As we will see later on in this topic in Subject outcome 4.2, data can also be represented in graphs and charts to get a quick overall picture. Before you summarise data, it is important to understand how to organise data using **frequency distributions** like that shown in Table 1.

In this subject outcome, however, we will mostly look at frequency distributions of **ungrouped data** where there are no **intervals**. We will look at **grouped data frequency distributions** in Subject outcome 4.2.

Watch the video called “Intro to data handling” to see a simple example of how to organise raw ungrouped data into a frequency table.

As the size of the data set grows it becomes even more difficult to handle it in its raw form, so we almost always need to organise data before analysing it.

### Activity 1.1: Organise data using a frequency table

**Time required:** 10 minutes

**What you need:**

- a pen or pencil
- paper

**What to do:**

The following is a sample of raw data collected on the number of digital devices owned per household in a suburb.

Number of digital devices per household:

[latex]\scriptsize \begin{array}{*{20}{r}} 5 & 2 & 4 & 1 & 1 & 3 & 2 & 4 & 2 & 2 \\ 3 & 6 & 3 & 2 & 4 & 1 & 3 & 4 & 2 & 4 \\ 2 & 2 & 3 & 1 & 2 & 2 & 3 & 6 & 4 & 2 \\ 7 & 1 & 2 & 8 & 2 & 1 & 2 & 5 & 7 & 2 \\ 1 & 4 & 1 & 6 & 2 & 3 & 2 & 4 & 1 & 3 \end{array}[/latex]

- How many households in total were surveyed?
- Rewrite the data values from smallest to biggest. Can you group similar values together?
- How many households had two devices? How many had three devices? How many households had more than five devices?
- Create a table, which shows the number of households with one device, two devices, three devices, etc.
- How many devices did most households have?
- The fibre provider will install fibre if [latex]\scriptsize 50\%[/latex] of households have three or more devices. Based on your frequency distribution will the fibre provider install fibre in this suburb?

**What did you find?**

- By counting the number of entries, we see that [latex]\scriptsize 50[/latex] households were surveyed.
- Writing the entries from smallest to greatest gives us the following list.

[latex]\scriptsize \begin{array}{l}1,\text{ }1,\text{ }1,\text{ }1,\text{ }1,\text{ }1,\text{ }1,\text{ }1,\text{ }1,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\text{ }2,\\3,\text{ }3,\text{ }3,\text{ }3,\text{ }3,\text{ }3,\text{ }3,\text{ }3,\text{ }4,\text{ }4,\text{ }4,\text{ }4,\text{ }4,\text{ }4,\text{ }4,\text{ }4,\text{ }5,\text{ }5,\text{ }6,\text{ }6,\text{ }6,\text{ }7,\text{ }7,\text{ }8\end{array}[/latex] - To count how many households had two devices, we need to count the number of ‘[latex]\scriptsize 2[/latex]’ entries in our list. When we do, we see that [latex]\scriptsize 17[/latex] households had two devices. Eight households had three devices. To work out how many households had more than five devices we need to add the households with six, seven and eight devices together. There were [latex]\scriptsize 3[/latex] households with six devices, [latex]\scriptsize 2[/latex] with seven and [latex]\scriptsize 1[/latex] with eight devices. Altogether there were [latex]\scriptsize 6[/latex] households with more than five devices.
- You can set up a basic table with two columns to record the number of households with different numbers of devices.

**Number of devices in household****Frequency**[latex]\scriptsize 1[/latex] [latex]\scriptsize 9[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 17[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 8[/latex] [latex]\scriptsize 4[/latex] [latex]\scriptsize 8[/latex] [latex]\scriptsize 5[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 6[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 7[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 8[/latex] [latex]\scriptsize 1[/latex] Total [latex]\scriptsize 50[/latex] - From the frequency table, it is easy to see that [latex]\scriptsize 2[/latex] is the value that is recorded the greatest number of times (it has the highest frequency) so we can say most households have two devices.
- Since there are [latex]\scriptsize 50[/latex] households, [latex]\scriptsize 25[/latex] or more must have three or more devices. There are [latex]\scriptsize 8[/latex] households with three devices, [latex]\scriptsize 8[/latex] with four devices, [latex]\scriptsize 2[/latex] with five, [latex]\scriptsize 3[/latex] with six, [latex]\scriptsize 2[/latex] with seven and [latex]\scriptsize 1[/latex] with eight devices. Therefore, there is a total of [latex]\scriptsize 24[/latex] households with three or more devices. This is not quite enough for the fibre provider. There are less than [latex]\scriptsize 50\%[/latex] of households with three or more devices so they will not install in this suburb.

## Summary

In this unit you have learnt the following:

- Statistics is about collecting, organising, analysing, interpreting, and presenting data.
- Quantitative data is data that is numerical and can be measured like length, height, weight, time, cost, and number of items.
- Qualitative data is data that can be observed but not numerically measured like colour, appearance, and type.
- Quantitative data can be either discrete (whole numbers) or continuous (any real number).
- Descriptive statistics deals with data based on a subgroup of all the people, places, events or things we are studying, called a
**sample**. - Inferential statistics deals with observations about an entire population or group being studied.
- Ungrouped data is data in its raw form.
- Grouped data is data organised into intervals and usually presented in the form of a frequency table.
- Frequency is how often something occurs and a
**frequency distribution**, also called a**frequency table**, is used to organise qualitative and quantitative raw data.

# Unit 1: Assessment

#### Suggested time to complete: 20 minutes

- The mathematics marks, out of [latex]\scriptsize 50[/latex], for a class of learners are given below:

[latex]\scriptsize \begin{array}{l}46,\text{ }40,\text{ }12,\text{ }10,\text{ }47,\text{ }23,\text{ }26,\text{ }8,\text{ }29,\text{ }34,\text{ }37,\text{ }17,\text{ }40,\text{ }50,\text{ }18,\text{ }23,\text{ }33,\text{ }23,\\24,\text{ }15,\text{ }35,\text{ }23,\text{ }19,\text{ }22,\text{ }28,\text{ }35,\text{ }27,\text{ }42,\text{ }29,\text{ }26,\text{ }46,\text{ }33,\text{ }27,\text{ }19,\text{ }28\end{array}[/latex]- Complete the following frequency table using the above marks.

**Interval of scores****Frequency**[latex]\scriptsize 0-10[/latex] [latex]\scriptsize 11-20[/latex] [latex]\scriptsize 21-30[/latex] [latex]\scriptsize 31-40[/latex] [latex]\scriptsize 41-50[/latex] Total - How many learners wrote the test?
- In which interval did most learners score?
- If the pass mark was [latex]\scriptsize 21[/latex] out of [latex]\scriptsize 50[/latex], what percentage of learners passed the test?
- What percentage of learners scored more than [latex]\scriptsize 80\%[/latex] in the test?
- How many learners scored between [latex]\scriptsize 21[/latex] and [latex]\scriptsize 40[/latex] marks for the test?

- Complete the following frequency table using the above marks.
- The employees of a small company were surveyed about their retirement savings. The following frequency distribution shows the numbers of years to retirement for the [latex]\scriptsize 101[/latex] employees in the company.

**Years to retirement****Frequency**[latex]\scriptsize 10[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 11[/latex] [latex]\scriptsize 1[/latex] [latex]\scriptsize 12[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 13[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 14[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 15[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 16[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 17[/latex] [latex]\scriptsize 4[/latex] [latex]\scriptsize 18[/latex] [latex]\scriptsize 6[/latex] [latex]\scriptsize 19[/latex] [latex]\scriptsize 11[/latex] [latex]\scriptsize 20[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 21[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 22[/latex] [latex]\scriptsize 10[/latex] [latex]\scriptsize 23[/latex] [latex]\scriptsize 7[/latex] [latex]\scriptsize 24[/latex] [latex]\scriptsize 7[/latex] [latex]\scriptsize 25[/latex] [latex]\scriptsize 9[/latex] [latex]\scriptsize 26[/latex] [latex]\scriptsize 5[/latex] [latex]\scriptsize 27[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 28[/latex] [latex]\scriptsize 6[/latex] [latex]\scriptsize 29[/latex] [latex]\scriptsize 4[/latex] [latex]\scriptsize 30[/latex] [latex]\scriptsize 9[/latex] - How many employees will retire in less than [latex]\scriptsize 12[/latex] years?
- If retirement age is [latex]\scriptsize 65[/latex], how many employees are [latex]\scriptsize 50[/latex] years old or older?
- How many employees are younger than [latex]\scriptsize 40[/latex]?
- What percentage of the total employees will retire in [latex]\scriptsize 20[/latex] years or more?
- What percentage of the total employees will retire in [latex]\scriptsize 20[/latex] to [latex]\scriptsize 25[/latex] years’ time?
- If you were to create a grouped frequency table of the data with seven groups, what size would each interval need to be?
- Create a grouped frequency table for the data with intervals of three years.
- From the grouped frequency table is it possible to tell how many people will retire in [latex]\scriptsize 13[/latex] years or less? Why or why not.
- From the grouped frequency table, how many employees are [latex]\scriptsize 40[/latex] years old or younger?

The full solutions are at the end of the unit.

# Unit 1: Solutions

### Exercise 1.1

- The data is qualitative data.
- The data is discrete quantitative data.
- The data is continuous quantitative data.

### Unit 1: Assessment

- .
- .

**Interval of scores****Frequency**[latex]\scriptsize 0-10[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 11-20[/latex] [latex]\scriptsize 6[/latex] [latex]\scriptsize 21-30[/latex] [latex]\scriptsize 14[/latex] [latex]\scriptsize 31-40[/latex] [latex]\scriptsize 8[/latex] [latex]\scriptsize 41-50[/latex] [latex]\scriptsize 5[/latex] Total [latex]\scriptsize 35[/latex] - [latex]\scriptsize 35[/latex] learners wrote the test.
- Most learners scored between [latex]\scriptsize 21[/latex] and [latex]\scriptsize 30[/latex] marks out of [latex]\scriptsize 50[/latex].
- Total learners scoring [latex]\scriptsize 21[/latex] marks or more: [latex]\scriptsize 14+8+5=27[/latex].

Percentage of total learners scoring [latex]\scriptsize 21[/latex] marks or more: [latex]\scriptsize \displaystyle \frac{{27}}{{35}}\times 100=77.1\%[/latex] - [latex]\scriptsize 80\%=\displaystyle \frac{{40}}{{50}}[/latex]. Five learners scored more than [latex]\scriptsize 40[/latex] marks.

Percentage of total learners scoring more than [latex]\scriptsize 40[/latex] marks: [latex]\scriptsize \displaystyle \frac{5}{{35}}\times 100=14.3\%[/latex] - Total learners scoring between [latex]\scriptsize 21[/latex] and [latex]\scriptsize 40[/latex] marks: [latex]\scriptsize 14+8=22[/latex]

- .
- .

**Years to retirement****Frequency**[latex]\scriptsize 10[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 11[/latex] [latex]\scriptsize 1[/latex] [latex]\scriptsize 12[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 13[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 14[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 15[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 16[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 17[/latex] [latex]\scriptsize 4[/latex] [latex]\scriptsize 18[/latex] [latex]\scriptsize 6[/latex] [latex]\scriptsize 19[/latex] [latex]\scriptsize 11[/latex] [latex]\scriptsize 20[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 21[/latex] [latex]\scriptsize 2[/latex] [latex]\scriptsize 22[/latex] [latex]\scriptsize 10[/latex] [latex]\scriptsize 23[/latex] [latex]\scriptsize 7[/latex] [latex]\scriptsize 24[/latex] [latex]\scriptsize 7[/latex] [latex]\scriptsize 25[/latex] [latex]\scriptsize 9[/latex] [latex]\scriptsize 26[/latex] [latex]\scriptsize 5[/latex] [latex]\scriptsize 27[/latex] [latex]\scriptsize 3[/latex] [latex]\scriptsize 28[/latex] [latex]\scriptsize 6[/latex] [latex]\scriptsize 29[/latex] [latex]\scriptsize 4[/latex] [latex]\scriptsize 30[/latex] [latex]\scriptsize 9[/latex] - Employees retiring in less than [latex]\scriptsize 12[/latex] years are those retiring in [latex]\scriptsize 11[/latex] or [latex]\scriptsize 10[/latex] years: [latex]\scriptsize 1+2=3[/latex]
- Employees aged [latex]\scriptsize 50[/latex] would retire in [latex]\scriptsize 15[/latex] years. So total employees aged [latex]\scriptsize 50[/latex] or older are those that will retire in [latex]\scriptsize 15[/latex] years or less: [latex]\scriptsize 3+3+2+2+1+2=13[/latex] employees.
- Employees younger than [latex]\scriptsize 40[/latex] would retire in more than [latex]\scriptsize 25[/latex] years i.e. [latex]\scriptsize 26[/latex] years or more: [latex]\scriptsize 5+3+6+4+9=27[/latex] employees.
- Employees retiring in [latex]\scriptsize 20[/latex] years’ or more: [latex]\scriptsize 2+2+10+7+7+9+5+3+6+4+9=64[/latex]

Percentage of total employees: [latex]\scriptsize \displaystyle \frac{{64}}{{101}}\times 100=63.4\%[/latex] - Employees retiring in [latex]\scriptsize 20[/latex] to [latex]\scriptsize 25[/latex] years’ time: [latex]\scriptsize 2+2+10+7+7+9=37[/latex]

Percentage of total employees: [latex]\scriptsize \displaystyle \frac{{37}}{{101}}\times 100=36.6\%[/latex] - There are a total of [latex]\scriptsize 21[/latex] intervals. Therefore, each of the seven groups would include three intervals.
- A grouped frequency table for the data with intervals of three years.

**Years to retirement****Frequency**[latex]\scriptsize 10-12[/latex] [latex]\scriptsize 5[/latex] [latex]\scriptsize 13-15[/latex] [latex]\scriptsize 8[/latex] [latex]\scriptsize 16-18[/latex] [latex]\scriptsize 13[/latex] [latex]\scriptsize 19-21[/latex] [latex]\scriptsize 15[/latex] [latex]\scriptsize 22-24[/latex] [latex]\scriptsize 24[/latex] [latex]\scriptsize 25-27[/latex] [latex]\scriptsize 17[/latex] [latex]\scriptsize 28-30[/latex] [latex]\scriptsize 19[/latex] - No. We can tell how many people will retire in [latex]\scriptsize 12[/latex] years or less. People retiring in [latex]\scriptsize 13[/latex] years are part of a group and so we do not know from the grouped data how many people this is exactly.
- Number of employees [latex]\scriptsize 40[/latex] years or younger that will retire in [latex]\scriptsize 25[/latex] years or more:

[latex]\scriptsize 17+19=36[/latex]

### Media Attributions

- figure1 © DHET is licensed under a CC BY (Attribution) license
- figure2 © DHET is licensed under a CC BY (Attribution) license
- table1 © DHET is licensed under a CC BY (Attribution) license