Tag Archives: Statistics

When Counting Gets Difficult, Part 2

Prion sp., March 22, 2016, seen just west of Heard Island.  Image credit: Bill Mitchell.
Prion sp., March 22, 2016, seen just west of Heard Island. Image credit: Bill Mitchell.

Earlier I posed a question: suppose a group of 40 birds are identified to genus level (prion sp.). Four photographs of random birds are identified to species level, all of one species that was expected to be in the minority (fulmar prion) and likely would be present in mixed flocks. How many birds of the 40 should be upgraded from genus-level ID to species-level ID?

Clearly there is a minimum of one fulmar prion present, because it was identified in the photographs. With four photographs and 40 birds, the chance of randomly catching the same bird all four times is quite small, so the number of fulmar prions is probably much higher than 1. At the same time, it would not be reasonable from a sample of our photographs to say all 40 were fulmar prions.

If we have four photographs of fulmar prions (A), what is the minimum number of non-fulmar prions (B) needed in a 40-prion flock to have a 95% chance of photographing at least one non-fulmar prion?

To answer this question, I used a Monte Carlo simulation, which I wrote in R. I generated 40-element combinations of A and B ranging from all A to all B. Then for each of those populations, I ran 100,000 trials, sampling 4 random birds from each population (with replacement). By tracking the proportion of trials for each population that had at least one B, it becomes possible to find the 95% confidence limit.

pop_size <- 40  # Set the population size
sample_n <- 4  # Set the number of samples (photographs)
n_trials <- 100000  # Set the number of trials for each population

x <- seq(0:pop_size)  # Create a vector of the numbers from 0 to pop_size (i.e. how many B in population)

sample_from_pop <- function(population, sample_size, n_trials){
	# Run Monte Carlo sampling, taking sample_size samples (with replacement)
                # from population (vector of TRUE/FALSE), repeating n_trials times
	# population: vector of TRUE/FALSE representing e.g. species A (TRUE) and B (FALSE)
	# sample_size: the number of members of the population to inspect
	# n_trials: the number of times to repeat the sampling
	my_count <- 0
	for(k in 1:n_trials){  # Repeat sampling n_trials times
		my_results <- sample(population, sample_size, replace=TRUE)  # Get the samples
		if(FALSE %in% my_results){  # Look for whether it had species B
			my_count <- my_count + 1  # Add one to the count if it did
		}
	}
	return(my_count/n_trials)  # Return the proportion of trials detecting species B
}

create_pop <- function(n,N){  # Make the populations
	return(append(rep(TRUE,N-n),rep(FALSE,n)))  # Populations have N-n repetitions of TRUE (sp. A), n reps of FALSE (sp. B)
}

mypops <- lapply(0:pop_size, create_pop, pop_size)  # Create populations for sampling

# Apply the sampling function to the populations, recording the proportion of trials sampling at least one of species B
my_percentages <- sapply(mypops, sample_from_pop, sample_size=sample_n, n_trials=n_trials)

My simulation results showed that with 22 or more birds of species B (non-fulmar prions), there was a >95% that they would be detected. In other words, from my photographic data, there is a 95% probability that the flock of 40 prions contained no fewer than 19 fulmar prions.

Let’s take a look at it graphically.

library(ggplot2)

mydata <- data.frame(my_percentages, 0:pop_size)  # Make a data.frame with the results and the # of species B
names(mydata) <- c("DetProb", "B")  # Rename the columns to something friendly and vaguely descriptive

p <- ggplot(mydata2, aes(x=B,y=DetProb)) + geom_point() # Create the basic ggplot2 scatterplot
p <- p + geom_hline(yintercept=0.95)  # Add a horizontal line at 95%
p <- p + theme_bw() + labs(x="# of species B (pop. 40)", y="Detection probability of B")  # Tidy up the presentation and labeling
print(p)  # Display it!
Results of the Monte Carlo simulation.  At left is all A, while at right is a population with all B.  The horizontal line is the 95% probability line.  Points above the line have a >95% chance of detecting species B.
Results of the Monte Carlo simulation. At left is all A, while at right is a population with all B. The horizontal line is the 95% probability line. Points above the line have a >95% chance of detecting species B.

With 22 or more non-fulmar prions, there’s a >95% chance one would be photographed. With 19 fulmar prions and 21 non-fulmar prions, there’s a >5% chance the non-fulmar prions would be missed. So our minimum number of fulmar prions is 19. I may have seen a flock of 40 fulmar prions, but there aren’t enough observations to say with statistical confidence that they were all fulmar prions.

Advertisements

Science on a Plane

Temperature profile flying in to MSP around 2120 UTC on April 25, 2016.  Image credit: Bill Mitchell (CC-BY).
Temperature profile flying in to MSP around 2120 UTC on April 25, 2016. Image credit: Bill Mitchell (CC-BY).

One of my favorite things to do on an airplane, when I can, is to take a temperature profile during the descent. Until recently, this could generally only be done on long international flights, when they had little screens which showed the altitude and temperature along with other flight data. However, I found on my latest trip that sometimes now even domestic flights have this information in a nice tabular form.

To take a temperature profile, when the captain makes the announcement that the descent is beginning, get out your notebook and set your screen to the flight information, where hopefully it tells you altitude (m) and temperature (°C). Record the altitude and temperature as frequently as they are updated on the way down, though you might set a minimum altitude change (20 m) to avoid lots of identical points if the plane levels off for a while. When you land, be sure to include the time, date, and location of arrival.

When you get a chance, transfer the data to a CSV (comma-separated value) file, including the column headers like in the example below.


Alt (m),Temp (C)
10325,-52
10138,-51
9992,-48
...
250,17

You can then use your favorite plotting program (I like R with ggplot) to plot up the data. I’ve included my R script for plotting at the bottom of the page. Just adjust the filename for infile, and it should do the rest for you.

At the top of the page is the profile I took on my way in to Minneapolis on the afternoon of April 25th. There were storms in the area, and we see a clear inversion layer (warmer air above than below) about 1 km up, with a smaller inversion at 1.6 km. From the linear regression, the average lapse rate was -6.44 °C/km, a bit lower than the typical value of 7 °C/km.

On the way in to Los Angeles the morning of April 25th, no strong inversion layer was present and temperature increased to the ground.

Temperature profile descending into Los Angeles on the morning of April 25, 2016.  Image credit: Bill Mitchell (CC-BY).
Temperature profile descending into Los Angeles on the morning of April 25, 2016. Image credit: Bill Mitchell (CC-BY).

This is a pretty easy way to do a little bit of science while you’re on the plane, and to practice the your plotting skills when you’re on the ground. For comparison, the University of Wyoming has records of weather balloon profiles from around the world. You can plot them yourself from the “Text: List” data, or use the “GIF: to 10mb” option to have it plotted for you.

Here is the code, although the long lines have been wrapped and will need to be rejoined before use.


# Script for plotting Alt/Temp profile
# File in format Alt (m),Temp (C)

infile <- "20160425_MSP_profile.csv" # Name of CSV file for plotting

library(ggplot2) # Needed for plotting
library(tools) # Needed for removing file extension to automate output filename

mydata <- read.csv(infile) # Import data
mydata[,1] <- mydata[,1]/1000 # convert m to km
mystats <- lm(mydata[,2]~mydata[,1]) # Run linear regression to get lapse rate
myslope <- mystats$coefficients[2] # Slope of regression
myint <- mystats$coefficients[1] # Intercept of regression

p <- ggplot(mydata, aes(x=mydata[,2], y=mydata[,1])) + stat_smooth(method="lm", color="blue") + geom_point() + labs(x="Temp (C)",y="Altitude (km)") + annotate("text", x=-30, y=1, label=sprintf("y=%.2fx + %.2f",myslope,myint)) + theme_classic() # Create plot

png(file=paste(file_path_sans_ext(infile),"png",sep="."), width=800, height=800) # Set output image info
print(p) # Plot it!
dev.off() # Done plotting

Addressing Uncertainty Numerically

Casino Monte Carlo, Monaco.  Image credit: Positiv (CC-BY-SA).
Casino Monte Carlo, Monaco. Image credit: Positiv (CC-BY-SA).

Recently I wrote about what scientific uncertainty is, and why it might be important. There are a few things which can be done to understand the “fuzziness” of a number. However, the real world can get complicated in a way which most introductory courses don’t deal with.

For instance, a standard introduction would help you through what to do if you are adding or multiplying two numbers. In the former case, you can add the uncertainties together, and in the latter you would take the root mean square uncertainty. This can be a little bit of math, but in general it’s fairly straightforward: plug in the numbers and out pops an answer.

Sometimes, though, figuring out the proper formula can be time-consuming or nearly impossible. In these cases, there is still a way to get a satisfactory answer: a Monte Carlo simulation, so named for the iconic casino in Monaco (pictured above).

The Monte Carlo simulation is effectively an exercise in rolling dice. Rather than handle the uncertainties analytically, random numbers are used (typically a Gaussian distribution of appropriate mean and width) and combined together in the equation in question—with no uncertainties attached. After a large number of repetitions (for my purposes around 104–107), the statistical distribution can be evaluated. In many cases that evaluation means taking a mean and standard deviation, although it is very informative to look at the distribution to be sure it is at least roughly Gaussian. If the distribution is bimodal or highly skewed, a standard deviation may not be an appropriate metric.

There is another place where Monte Carlo simulations are useful. Suppose you want to know the average (and distribution) produced when you roll four 6-sided dice, and sum the three highest. Is there a way to solve it analytically? Probably, but I can code (and run) the Monte Carlo simulation much more quickly.*

Here’s how to do that in the statistics program R:

my_repetitions <- 10000; # Set the number of trials
# Use a variable for this so when you change N, you only have to do so once.
# Tracking down every place you entered 10000 by hand is no fun.

my_roll <- function(){ # Declare a function to roll four dice and sum the highest three
# returns: sum of highest three of four six-sided dice rolls (integer)

roll_four <- sample(seq(1, 6), size=4, replace=TRUE); # Create a four-element vector representing four 6-sided dice
return(sum(roll_four) - min(roll_four)) # Sum the four dice, and subtract the minimum roll
}

my_results <- replicate(my_repetitions, my_roll()); # Create a vector of dimension my_repetitions, with each element the result of my_roll()

summary(my_results); # Show a statistical summary of the results
hist(my_results); # Plot a quick-and-dirty histogram of the results

Monte Carlo results (N=10,000) for rolling four 6-sided dice and summing the highest three.
Monte Carlo results (N=10,000) for rolling four 6-sided dice and summing the highest three.

Monte Carlo simulations are very handy when there are non-linear interactions between parameters with uncertainties. There may be no analytical solution, but creating a Monte Carlo procedure for determining the final distribution is relatively straightforward. In fact, this is what many weather models and climate models do. An ensemble of different numerical predictions will be put together, and the result gives a statistical likelihood of different events. With weather models, the trick is to keep the code running quickly enough that when it’s done the result is still a prediction—predicting what the weather would have been like a week ago is not useful in the way that knowing what tomorrow may have in store is.

This post is not meant to be an introduction to R (which I used), or to good programming practices (which I hope I used). Many of these are available online, or at colleges and universities in facilities like UC Berkeley’s D-Lab. I may also write about these topics at a later point.

* For those keeping score at home, the mean of this dice rolling scheme is 12.25, median is 12.00 with the middle 50% ranging from 10 to 14.