Tag Archives: Programming

When Counting Gets Difficult, Part 2

Prion sp., March 22, 2016, seen just west of Heard Island.  Image credit: Bill Mitchell.
Prion sp., March 22, 2016, seen just west of Heard Island. Image credit: Bill Mitchell.

Earlier I posed a question: suppose a group of 40 birds are identified to genus level (prion sp.). Four photographs of random birds are identified to species level, all of one species that was expected to be in the minority (fulmar prion) and likely would be present in mixed flocks. How many birds of the 40 should be upgraded from genus-level ID to species-level ID?

Clearly there is a minimum of one fulmar prion present, because it was identified in the photographs. With four photographs and 40 birds, the chance of randomly catching the same bird all four times is quite small, so the number of fulmar prions is probably much higher than 1. At the same time, it would not be reasonable from a sample of our photographs to say all 40 were fulmar prions.

If we have four photographs of fulmar prions (A), what is the minimum number of non-fulmar prions (B) needed in a 40-prion flock to have a 95% chance of photographing at least one non-fulmar prion?

To answer this question, I used a Monte Carlo simulation, which I wrote in R. I generated 40-element combinations of A and B ranging from all A to all B. Then for each of those populations, I ran 100,000 trials, sampling 4 random birds from each population (with replacement). By tracking the proportion of trials for each population that had at least one B, it becomes possible to find the 95% confidence limit.

pop_size <- 40  # Set the population size
sample_n <- 4  # Set the number of samples (photographs)
n_trials <- 100000  # Set the number of trials for each population

x <- seq(0:pop_size)  # Create a vector of the numbers from 0 to pop_size (i.e. how many B in population)

sample_from_pop <- function(population, sample_size, n_trials){
	# Run Monte Carlo sampling, taking sample_size samples (with replacement)
                # from population (vector of TRUE/FALSE), repeating n_trials times
	# population: vector of TRUE/FALSE representing e.g. species A (TRUE) and B (FALSE)
	# sample_size: the number of members of the population to inspect
	# n_trials: the number of times to repeat the sampling
	my_count <- 0
	for(k in 1:n_trials){  # Repeat sampling n_trials times
		my_results <- sample(population, sample_size, replace=TRUE)  # Get the samples
		if(FALSE %in% my_results){  # Look for whether it had species B
			my_count <- my_count + 1  # Add one to the count if it did
		}
	}
	return(my_count/n_trials)  # Return the proportion of trials detecting species B
}

create_pop <- function(n,N){  # Make the populations
	return(append(rep(TRUE,N-n),rep(FALSE,n)))  # Populations have N-n repetitions of TRUE (sp. A), n reps of FALSE (sp. B)
}

mypops <- lapply(0:pop_size, create_pop, pop_size)  # Create populations for sampling

# Apply the sampling function to the populations, recording the proportion of trials sampling at least one of species B
my_percentages <- sapply(mypops, sample_from_pop, sample_size=sample_n, n_trials=n_trials)

My simulation results showed that with 22 or more birds of species B (non-fulmar prions), there was a >95% that they would be detected. In other words, from my photographic data, there is a 95% probability that the flock of 40 prions contained no fewer than 19 fulmar prions.

Let’s take a look at it graphically.

library(ggplot2)

mydata <- data.frame(my_percentages, 0:pop_size)  # Make a data.frame with the results and the # of species B
names(mydata) <- c("DetProb", "B")  # Rename the columns to something friendly and vaguely descriptive

p <- ggplot(mydata2, aes(x=B,y=DetProb)) + geom_point() # Create the basic ggplot2 scatterplot
p <- p + geom_hline(yintercept=0.95)  # Add a horizontal line at 95%
p <- p + theme_bw() + labs(x="# of species B (pop. 40)", y="Detection probability of B")  # Tidy up the presentation and labeling
print(p)  # Display it!
Results of the Monte Carlo simulation.  At left is all A, while at right is a population with all B.  The horizontal line is the 95% probability line.  Points above the line have a >95% chance of detecting species B.
Results of the Monte Carlo simulation. At left is all A, while at right is a population with all B. The horizontal line is the 95% probability line. Points above the line have a >95% chance of detecting species B.

With 22 or more non-fulmar prions, there’s a >95% chance one would be photographed. With 19 fulmar prions and 21 non-fulmar prions, there’s a >5% chance the non-fulmar prions would be missed. So our minimum number of fulmar prions is 19. I may have seen a flock of 40 fulmar prions, but there aren’t enough observations to say with statistical confidence that they were all fulmar prions.

Advertisements

Satellite Image Processing Revisited

Heard Island on Nov. 20, 2015, with image processing underway in QGIS.  Image credit: Bill Mitchell (CC-BY) with satellite imagery from USGS (EO-1 satellite, ALI instrument).
Heard Island on Nov. 20, 2015, with image processing underway in QGIS. Image credit: Bill Mitchell (CC-BY) with satellite imagery from USGS (EO-1 satellite, ALI instrument).

Following up on my earlier post about satellite image processing, I am happy to report that I have made progress in being able to process images myself! Through a fortunate combination of search terms, timing, and luck, I managed to come across two key pieces of information that I needed.

First, I found out how to make RGB images from raster data layers, such as different spectral bands on a satellite, fairly easily with QGIS. That was a big step forward from how I had been doing it previously, which was inelegant, inefficient, and only mostly worked. Stacking three layers (one each for red, green, and blue) into a virtual raster catalog was just a few clicks away (Raster | Miscellaneous | Build Virtual Raster (Catalog)).

Encouraged by the success with that project, I continued clicking around and stumbled across some mention of pan-sharpening (also pan sharpening), where a panchromatic (all-color) detector at high resolution is used to enhance the resolution of a colored image (sharpen). Alternately, you can think of it in the complementary way, where lower-resolution color data is added to a high-resolution greyscale image. So thanks to this blog post, I was able to find out what I needed to do to make that happen in QGIS (and Orfeo Toolbox).

Of course, it would be too easy for that to work. I didn’t have the Orfeo Toolbox installed which that needed, and ended up having to compile that from source code.* When the compiler finished and the program was installed, I went to tell QGIS where it was—but a bug in QGIS prevented me from entering the folder location. First, having just installed and compiled stuff, I attempted to get the latest version of QGIS and many of the tools on which QGIS relies. Being unsuccessful in making all of those and some of the compiler configuration software play nicely with each other, I eventually remembered I could get updated packages through apt-get, which gets pre-compiled binary files put out by the maintainers of Debian Linux. That solution worked, I added the folder location, and now I can have my pan-sharpened images.

Here for your viewing pleasure is my first properly pan-sharpened image: Heard Island on Nov. 20, 2015, seen in “true color” by the Advanced Land Imager (ALI) on the EO-1 satellite.** I’m not convinced it’s right, and I think the contrast needs to be brought down a bit, but I think it’s close.

Heard Island in true color on Nov. 20, 2015.  Image processing: Bill Mitchell (CC-BY) using data from USGS/EO-1.
Heard Island in true color on Nov. 20, 2015. Image processing: Bill Mitchell (CC-BY) using data from USGS (EO-1/ALI).

* Knowing how to compile software from source code is a rather handy skill.
** Emily Lakdawalla has written a great explanation of what “true color” means.

Addressing Uncertainty Numerically

Casino Monte Carlo, Monaco.  Image credit: Positiv (CC-BY-SA).
Casino Monte Carlo, Monaco. Image credit: Positiv (CC-BY-SA).

Recently I wrote about what scientific uncertainty is, and why it might be important. There are a few things which can be done to understand the “fuzziness” of a number. However, the real world can get complicated in a way which most introductory courses don’t deal with.

For instance, a standard introduction would help you through what to do if you are adding or multiplying two numbers. In the former case, you can add the uncertainties together, and in the latter you would take the root mean square uncertainty. This can be a little bit of math, but in general it’s fairly straightforward: plug in the numbers and out pops an answer.

Sometimes, though, figuring out the proper formula can be time-consuming or nearly impossible. In these cases, there is still a way to get a satisfactory answer: a Monte Carlo simulation, so named for the iconic casino in Monaco (pictured above).

The Monte Carlo simulation is effectively an exercise in rolling dice. Rather than handle the uncertainties analytically, random numbers are used (typically a Gaussian distribution of appropriate mean and width) and combined together in the equation in question—with no uncertainties attached. After a large number of repetitions (for my purposes around 104–107), the statistical distribution can be evaluated. In many cases that evaluation means taking a mean and standard deviation, although it is very informative to look at the distribution to be sure it is at least roughly Gaussian. If the distribution is bimodal or highly skewed, a standard deviation may not be an appropriate metric.

There is another place where Monte Carlo simulations are useful. Suppose you want to know the average (and distribution) produced when you roll four 6-sided dice, and sum the three highest. Is there a way to solve it analytically? Probably, but I can code (and run) the Monte Carlo simulation much more quickly.*

Here’s how to do that in the statistics program R:

my_repetitions <- 10000; # Set the number of trials
# Use a variable for this so when you change N, you only have to do so once.
# Tracking down every place you entered 10000 by hand is no fun.

my_roll <- function(){ # Declare a function to roll four dice and sum the highest three
# returns: sum of highest three of four six-sided dice rolls (integer)

roll_four <- sample(seq(1, 6), size=4, replace=TRUE); # Create a four-element vector representing four 6-sided dice
return(sum(roll_four) - min(roll_four)) # Sum the four dice, and subtract the minimum roll
}

my_results <- replicate(my_repetitions, my_roll()); # Create a vector of dimension my_repetitions, with each element the result of my_roll()

summary(my_results); # Show a statistical summary of the results
hist(my_results); # Plot a quick-and-dirty histogram of the results

Monte Carlo results (N=10,000) for rolling four 6-sided dice and summing the highest three.
Monte Carlo results (N=10,000) for rolling four 6-sided dice and summing the highest three.

Monte Carlo simulations are very handy when there are non-linear interactions between parameters with uncertainties. There may be no analytical solution, but creating a Monte Carlo procedure for determining the final distribution is relatively straightforward. In fact, this is what many weather models and climate models do. An ensemble of different numerical predictions will be put together, and the result gives a statistical likelihood of different events. With weather models, the trick is to keep the code running quickly enough that when it’s done the result is still a prediction—predicting what the weather would have been like a week ago is not useful in the way that knowing what tomorrow may have in store is.

This post is not meant to be an introduction to R (which I used), or to good programming practices (which I hope I used). Many of these are available online, or at colleges and universities in facilities like UC Berkeley’s D-Lab. I may also write about these topics at a later point.

* For those keeping score at home, the mean of this dice rolling scheme is 12.25, median is 12.00 with the middle 50% ranging from 10 to 14.