The SoS has released the latest random sample report for Tim Draper’s initiative to divide the state into six Californias.
Calaveras, Humboldt, Kings, Modoc, Mono, Nevada, and Ventura counties have turned in their raw counts, bringing Tim Draper’s total to 1,083,353 raw signatures (it was 1,038,836 in my first report). That lowers the validity rate he needs to qualify to 74.5% (was 77.7%) and to avoid a full count to 82.0% (was 88.5%). Below 70.8% (was 73.9%) and he doesn’t even get a full count. We’re still waiting for Alameda, Alpine, Amador, Inyo, and Trinity Counties to report their raw numbers. If they bring the raw total up to the 1.3 million claimed, then he needs 62.1% to qualify, 68.3% to avoid a full count.
Also, the following counties have completed their random samples (with validity rates as noted): Merced (66.7%), Modoc (65.4%), Mono (81.0%), Placer (72.5%), and San Joaquin (72.7%). The uncorrected validity rate is 71.8%, up from 70.7% in the first report. When one corrects for duplicates, the validity rate is 66.4%, up from 58.1%.
Speaking of correcting for duplicates, I think I’ve convinced myself that I now understand where the “-1” comes from in the correction factor for duplicate signatures. It’s best explained with an example.
Suppose I have 100 signatures, and I pick 25 of them (one fourth of 100) at random to check. Of the 25 signatures, I find that one person (Mary) isn’t registered to vote, and one person who is registered (John) has signed twice. That means I have 23 valid signatures and 2 invalid ones (Mary’s and one of John’s). The uncorrected validity rate, before the extra accounting for duplicates, is 92% (23/25).
Remember that these signatures were picked at random, so if I found two signatures from John in the 25 I picked, it’s likely that there are three others from John in the other 75. (Well, maybe not likely, but that’s the best estimate.) So John really accounts for 4 duplicate signatures, not just one. But we already accounted for one of those duplicates by calling it invalid in our sample, so we just have to account for the 3 extra duplicates in the unsampled portion.
Also, if John signed more than once in this sample of 25, we can suppose that there are probably three other people in the other 75 who also signed more than once, and the best estimate is that they each also signed five times (one of which is a valid signature in our sample). So a factor of 4 (100/25) for the four people (John plus an estimate of three others) who signed more than once, times 3 (4 – 1) for the fact that one of each duplicate is already accounted for by the uncorrected calculation, means John’s duplicate signature should be given a weight of 12. 12/100 is 12%, so the corrected validity rate is 80%.
Of course, if we found two people in the sample of 25 who signed twice, or if we found three signatures from John in that sample (one that we consider valid and two that we consider invalid), we’d have twice the correction factor (24%), etc.
Now before you start thinking “Gee, if I’m against a petition, I should sign it as many times as I can instead of not signing it at all so as to drive up the duplicate rate, since duplicate signatures hurt more than plain invalid ones”, I have to point out that this is illegal. Election Code section 18612 says “Every person is guilty of a misdemeanor who knowingly signs his or her own name more than once to any initiative, referendum, or recall petition ….” Deliberately signing a false name, while hurting the petition less than signing twice, carries a harsher penalty. Election code section 18613 says “Every person who subscribes to any initiative, referendum, or recall petition a fictitious name […] is guilty of a felony and is punishable by imprisonment pursuant to subdivision (h) of Section 1170 of the Penal Code for two, three, or four years.” So don’t do it.
–Steve Chessin
President, Californians for Electoral Reform (CfER)
www.cfer.org
The opinions expressed here are my own and not necessarily those of CfER.
The samples are too small to get reliable estimates of duplicates.
Let’s take the case of Solano County. If we take the valid_in_sample/sample_size * raw_count, we get 10,998 estimated valid. But based on the estimate of duplicates, there will only be 8,893 valid, meaning that there are an estimated 2,105 duplicates, which we double to get the number of signatures from double signers (4,210).
*In calculating the expected yield, I added back in the two duplicates that were found in the sample. I’ll (eventually) get around to explaining this.
So of the 16,221 signatures, it is estimated that:
6,788 valid and not duplicated.
4,210 valid, but duplicated by 2,105 signers.
5,223 invalid.
Let’s assume that the estimates are correct, and imagine that the Solano election authorities did a full check, and prepared a card for each signature. They would stamp them with a “V” for valid; “I” for invalid; and “D” for duplicate. The duplicates would also include a serial number from 1…2105 indicating the signer, and an A or B, based on which was signed first.
We shuffle the cards, and draw 500. We would expect 209 V cards, 161 I cards, and and 130 D cards. We couldn’t tell that most of these 130 were duplicates, except that we had done the full count.
And most likely won’t have their mate, since 96.9% of the duplicates weren’t drawn in our sample of 500.
We look at our new sample, and count the number of duplicate pairs in the sample, that is where we have a DA and DB card with the same serial number.
We expect two, but we most likely would not have that many. 12.8% of the time, there would be no duplicates. That is, even if the Solano sample is as duplicate laden as it appears to be, it would look as clean as that for San Joaquin.
0 12.8%
1 27.1%
2 27.9%
3 18.5%
4 8.9%
5 3.3%
6 1.0%
7+ 0.3%
On **average**, we would find 2 duplicates, but almost 3/4 of the time we would not find 2 duplicates. But the estimate assumes that all samples would have 2 duplicates.
Also notice the long tail to the right of the distribution. Unless Solano is anomalous in that signers did not remember signing previously, or circulators were unscrupulous or careless, it is quite likely the true duplicate rate is much lower, but they happened to get a sample where both signatures were found.
If we assume smaller numbers of duplicates, we will still sometimes find two or more duplicates in our sample. In the following table, the first column is the rate of duplicates relative to that estimated for Solano, and the second is the percentage of samples that will have two or more duplicates.
100% 60.0%
90% 54.2%
80% 47.8%
70% 40.9%
60% 33.5%
50% 26.0%
40% 18.4%
30% 11.3%
20% 5.3%
10% 1.1%
So with a duplicate rate half of that estimated in Solano, we would still find 26% of samples with 2 or more duplicates, and thus estimate a duplication rate twice or more what it really is. And if the true duplication rate is only 1/5 of that estimated for Solano, we would find 2 or more duplicates 5.3% of the time, and produce an estimate of 5 times what it really is.
It’s like you asked someone to tell you how far it is to some city and they reply, “umm, about 120 miles”. You suggest that they don’t sound too sure, and they reply that it they are 99% sure that it is between 10 and 300 miles.
Here is the mathematics behind the estimate of duplicates. The worksheet shows that A = raw_count/sample_size.
For Solano County, A = 16221 / 500 = 32.44. Alternatively we can construct 32.44 non-overlapping samples of 500 from the raw signatures. Or the probability of a signature being selected for our sample is 1/A.
Let S be the number of singleton valid signatures, D be the number of duplicate signers, who signed first D times, and second D times, and I be the number of invalid signatures.
S + D + D + I = T (total signatures or raw count).
We want to estimate:
S + D (since the first signature of duplicates is valid.
The probable number of duplicate first signatures in our sample is D/A. And the probability of us having the matching second signature is 1/A. So that D/(A*A) is the number of detected duplicates in our sample:
X = D/(A*A)
or
D = X * (A*A)
The number of apparently valid signatures in our sample is:
S/A + D/A + D/A – X
Most of the duplicates appear valid because we don’t have the matching signature in our sample, except for the X we did find.
The value V is the estimated number of valid signatures, but it (mostly) double counts the duplicate signatures.
V = S + D + (D – X*A)
We need to remove D – X*A
But D = X * (A*A)
So we need to remove:
X * (A*A) – X*A
Factoring out the X, and one of the A’s:
X * A * (A-1).
The worksheet shows B = A * (A-1)
So we need to remove from our valid count:
V – X * B
To get the final estimate. The worksheet shows C = X * B. This is not actually the number of the duplicates, since it doesn’t count those that were detected in the sample.