<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: The thrill of the chase</title>
	<atom:link href="http://www.bit-player.org/2010/the-thrill-of-the-chase/feed" rel="self" type="application/rss+xml" />
	<link>http://bit-player.org/2010/the-thrill-of-the-chase</link>
	<description>An amateur's outlook on computation and mathematics.</description>
	<pubDate>Wed, 08 Feb 2012 13:17:40 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.3</generator>
		<item>
		<title>By: Mark Myatt</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3092</link>
		<dc:creator>Mark Myatt</dc:creator>
		<pubDate>Mon, 26 Jul 2010 14:18:40 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3092</guid>
		<description>I've always known this as the "Petersen Method" and used it (or derivatives) for epidemiological problems such as assessing the completeness of notifiable disease reporting systems,  the exhaustivity of case-finding methods, and estimating the numbers of commercial sex workers in towns and cities. It has a known bias towards estimation and the Seber Estimator:

    [(E1 + 1)(E2 + 1) / (S + 1)]  - 1

is preferred.

The simplicity of the arithmetic hides the complications of doing data collection well. The Petersen and Seber estimators assume (1) that the population is closed (e.g. the number of errors is constant) - this can also mean that the population is both closed and "well defined" (e.g. the two proof-readers have the same version of the text), (2) all errors are equally catchable by both proof-readers, (3) detected errors can be matched, and (4) detection of an error by one proof-reader is not influenced by the detection of the same error buy the other proof-reader. This is, I think, more stringent than two simple random samples mentioned above (that would cover (2) and (4) only). It may be easy (I have my doubts) to meet these assumptions in the context of proof-reading but is very difficult in epidemiological applications. In my experience, capture-recapture studies are often flawed (all of mine have been!). It is then a matter of identifying violations of assumptions and their likely effect in terms of the magnitude and direction on the final estimate.

There are several methods that use multiple testers / multiple lists. John might be interested in:

Schnabel, Z. E. (1938), "The Estimation of the Total Fish Population of a Lake", American Mathematical Monthly, 45, 348–352.

which proposes an extension to the Petersen method using cumulative marking using a single mark (i.e. no need for multiple marking or different marks for different testers).

Good summaries of abundance estimators for capture-recapture data can be found in many textbooks. Krebs' "Ecological Methodology" is well regarded.</description>
		<content:encoded><![CDATA[<p>I&#8217;ve always known this as the &#8220;Petersen Method&#8221; and used it (or derivatives) for epidemiological problems such as assessing the completeness of notifiable disease reporting systems,  the exhaustivity of case-finding methods, and estimating the numbers of commercial sex workers in towns and cities. It has a known bias towards estimation and the Seber Estimator:</p>
<p>    [(E1 + 1)(E2 + 1) / (S + 1)]  - 1</p>
<p>is preferred.</p>
<p>The simplicity of the arithmetic hides the complications of doing data collection well. The Petersen and Seber estimators assume (1) that the population is closed (e.g. the number of errors is constant) - this can also mean that the population is both closed and &#8220;well defined&#8221; (e.g. the two proof-readers have the same version of the text), (2) all errors are equally catchable by both proof-readers, (3) detected errors can be matched, and (4) detection of an error by one proof-reader is not influenced by the detection of the same error buy the other proof-reader. This is, I think, more stringent than two simple random samples mentioned above (that would cover (2) and (4) only). It may be easy (I have my doubts) to meet these assumptions in the context of proof-reading but is very difficult in epidemiological applications. In my experience, capture-recapture studies are often flawed (all of mine have been!). It is then a matter of identifying violations of assumptions and their likely effect in terms of the magnitude and direction on the final estimate.</p>
<p>There are several methods that use multiple testers / multiple lists. John might be interested in:</p>
<p>Schnabel, Z. E. (1938), &#8220;The Estimation of the Total Fish Population of a Lake&#8221;, American Mathematical Monthly, 45, 348–352.</p>
<p>which proposes an extension to the Petersen method using cumulative marking using a single mark (i.e. no need for multiple marking or different marks for different testers).</p>
<p>Good summaries of abundance estimators for capture-recapture data can be found in many textbooks. Krebs&#8217; &#8220;Ecological Methodology&#8221; is well regarded.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sachin</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3073</link>
		<dc:creator>Sachin</dc:creator>
		<pubDate>Fri, 23 Jul 2010 07:11:53 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3073</guid>
		<description>Hi, 
How can we find the expected number of defects when there are more than 2 tester?

Regards,
Sachin.</description>
		<content:encoded><![CDATA[<p>Hi,<br />
How can we find the expected number of defects when there are more than 2 tester?</p>
<p>Regards,<br />
Sachin.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cody</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3043</link>
		<dc:creator>Cody</dc:creator>
		<pubDate>Tue, 13 Jul 2010 15:30:17 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3043</guid>
		<description>I was first struck by the ease of finding things out on the web a few years ago, while working on a physics assignment in college with my apartment-mate. We were sitting in my room, me at my computer, when he asked what the half life of a particular isotope of molybdenum was, and I entered it directly into a url prefixed with "en.wikipedia.org/wiki/".

Our conversation then turned to how I couldn't have found the answer faster had he handed me a book and told me what page to look on, let alone if we had needed to go to a library or arrange for a reference to be sent from another library!</description>
		<content:encoded><![CDATA[<p>I was first struck by the ease of finding things out on the web a few years ago, while working on a physics assignment in college with my apartment-mate. We were sitting in my room, me at my computer, when he asked what the half life of a particular isotope of molybdenum was, and I entered it directly into a url prefixed with &#8220;en.wikipedia.org/wiki/&#8221;.</p>
<p>Our conversation then turned to how I couldn&#8217;t have found the answer faster had he handed me a book and told me what page to look on, let alone if we had needed to go to a library or arrange for a reference to be sent from another library!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joel Reyes Noche</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3042</link>
		<dc:creator>Joel Reyes Noche</dc:creator>
		<pubDate>Tue, 13 Jul 2010 00:49:09 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3042</guid>
		<description>When John D. Barrow looks at the number of *unfound* errors, he uses the formula E1E2/S for the total number of errors.  Here's a quotation from p. 83 of his book "Impossibility" (Oxford: Oxford University Press, 1998):

- - - - -
Suppose that two editors, Jack and Jill, independently read a long newspaper article supplied by one of their journalists.  Jack finds A typing errors, whilst Jill finds B typing errors.  They compare copies and discover that they both found the same error on C occasions.  How many errors do you expect to remain, unfound, in the article?

Let us suppose that the total number of errors in the article is E.  This means that the number that have yet to be found equals E-A-B+C.  The last factor of +C is so that we don't double count the errors that Jack and Jill both found.  Now, if the probability that Jack spots an error is p, and the probability that Jill spots an error is q, then we expect that A=pE, B=qE, and C=pqE, because they search independently.  So, AB=pqE x E; hence AB=CE.  Now we have the answer:  the number of unfound errors equals E-A-B+C=AB/C - A - B + C, where we have replaced the unknown quantity, E, by AB/C.  rearranging our formula, we have shown that the number of unfound errors is equal to (A-C)(B-C)/C; that is,

Number of unfound errors=(Number found only by Jack) x (Number found only by Jill)/(Number found by both Jack and Jill)
- - - - -</description>
		<content:encoded><![CDATA[<p>When John D. Barrow looks at the number of *unfound* errors, he uses the formula E1E2/S for the total number of errors.  Here&#8217;s a quotation from p. 83 of his book &#8220;Impossibility&#8221; (Oxford: Oxford University Press, 1998):</p>
<p>- - - - -<br />
Suppose that two editors, Jack and Jill, independently read a long newspaper article supplied by one of their journalists.  Jack finds A typing errors, whilst Jill finds B typing errors.  They compare copies and discover that they both found the same error on C occasions.  How many errors do you expect to remain, unfound, in the article?</p>
<p>Let us suppose that the total number of errors in the article is E.  This means that the number that have yet to be found equals E-A-B+C.  The last factor of +C is so that we don&#8217;t double count the errors that Jack and Jill both found.  Now, if the probability that Jack spots an error is p, and the probability that Jill spots an error is q, then we expect that A=pE, B=qE, and C=pqE, because they search independently.  So, AB=pqE x E; hence AB=CE.  Now we have the answer:  the number of unfound errors equals E-A-B+C=AB/C - A - B + C, where we have replaced the unknown quantity, E, by AB/C.  rearranging our formula, we have shown that the number of unfound errors is equal to (A-C)(B-C)/C; that is,</p>
<p>Number of unfound errors=(Number found only by Jack) x (Number found only by Jill)/(Number found by both Jack and Jill)<br />
- - - - -</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3041</link>
		<dc:creator>John</dc:creator>
		<pubDate>Mon, 12 Jul 2010 21:09:24 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3041</guid>
		<description>(But of course you should add a new mark to the fish every time you release it.)</description>
		<content:encoded><![CDATA[<p>(But of course you should add a new mark to the fish every time you release it.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3040</link>
		<dc:creator>John</dc:creator>
		<pubDate>Mon, 12 Jul 2010 21:08:33 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3040</guid>
		<description>Okay, but what about three or more samples?  I suppose the limit is catch-and-release fishing, when you could unluckily catch the same fish over and over again.</description>
		<content:encoded><![CDATA[<p>Okay, but what about three or more samples?  I suppose the limit is catch-and-release fishing, when you could unluckily catch the same fish over and over again.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brian</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3039</link>
		<dc:creator>brian</dc:creator>
		<pubDate>Mon, 12 Jul 2010 19:53:06 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3039</guid>
		<description>Phil Brass wrote:

&lt;blockquote&gt;I wonder how well this kind of thing could apply to software quality results&lt;/blockquote&gt;

That's the question raised in the blog posting by Mat Roberts that launched all this.

&lt;blockquote&gt;or adolescent European Plaice are susceptible to conformal peer pressure…&lt;/blockquote&gt;

Of course! Piercings. How could I have missed the obvious?</description>
		<content:encoded><![CDATA[<p>Phil Brass wrote:</p>
<blockquote><p>I wonder how well this kind of thing could apply to software quality results</p></blockquote>
<p>That&#8217;s the question raised in the blog posting by Mat Roberts that launched all this.</p>
<blockquote><p>or adolescent European Plaice are susceptible to conformal peer pressure…</p></blockquote>
<p>Of course! Piercings. How could I have missed the obvious?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brian</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3037</link>
		<dc:creator>brian</dc:creator>
		<pubDate>Mon, 12 Jul 2010 12:11:52 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3037</guid>
		<description>@Greg Wilson: Thanks for catching that. It was &lt;em&gt;not&lt;/em&gt; a planted test for proofreaders.

@Greg: If both readers find all the errors, then E1=E2=S, and also E1+E2-S = E1E2/S. But the situation we're interested in is one where we &lt;em&gt;don't know&lt;/em&gt; how many errors really exist; all we know is how many errors the two readers found, and how many of those were found by both. Your formula assumes that there are no other errors. The Lincoln/Peterson/Laplace formula assumes that both readers have gathered a random sample of errors from the unknown population. By the way, a corner case is when the two error sets have no overlap at all -- i.e., S = 0. One response to this situation is to say that with no overlap we have no upper bound on the total population. But it's also possible to take a less dire view, and use a formula along the lines of:
 (E1+1)(E2+1)/(s+1) - 1.

@John Cowan: The whole world is counting on &lt;em&gt;you&lt;/em&gt; to fill in that last missing piece of the community memory bank!</description>
		<content:encoded><![CDATA[<p>@Greg Wilson: Thanks for catching that. It was <em>not</em> a planted test for proofreaders.</p>
<p>@Greg: If both readers find all the errors, then E1=E2=S, and also E1+E2-S = E1E2/S. But the situation we&#8217;re interested in is one where we <em>don&#8217;t know</em> how many errors really exist; all we know is how many errors the two readers found, and how many of those were found by both. Your formula assumes that there are no other errors. The Lincoln/Peterson/Laplace formula assumes that both readers have gathered a random sample of errors from the unknown population. By the way, a corner case is when the two error sets have no overlap at all &#8212; i.e., S = 0. One response to this situation is to say that with no overlap we have no upper bound on the total population. But it&#8217;s also possible to take a less dire view, and use a formula along the lines of:<br />
 (E1+1)(E2+1)/(s+1) - 1.</p>
<p>@John Cowan: The whole world is counting on <em>you</em> to fill in that last missing piece of the community memory bank!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Phil Brass</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3036</link>
		<dc:creator>Phil Brass</dc:creator>
		<pubDate>Mon, 12 Jul 2010 11:53:17 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3036</guid>
		<description>Interesting article.  I wonder how well this kind of thing could apply to software quality results (and in particular for my interests, security vulnerabilities).  I.e. if Lincoln finds E1 defects and Petersen finds E2 defects, with S in common, can we safely assume that there are E1E2/S defects?

I don't think so, because testing for vulnerabilities is generally not a simple random sampling mechanism.  Much depends on the method used for testing - static analysis versus fuzzing versus manual inspection or grey-box testing for example.  

Certain testing methods will typically only find certain classes of vulnerabilities for example, and with varying degrees of effectiveness.  Fuzzing (pseudo-random input with varying degrees of structure) might be very good at uncovering buffer overflows and other parsing or data-plane/semantic-boundary transition problems, but it typically doesn't do as well at identifying privilege or permission bypass issues.

On the other hand, if you had two different teams fuzzing your firewall for example, and they came up with different results with some subset of identical results, it might make sense to apply the Lincoln index and estimate the total number of defects that could eventually be discovered by fuzzing, since fuzzing is essentially a random selection process.  

The other interesting thing about fuzzing is that it tends to be subject to diminishing returns, so you might also be able to estimate the total "findable" number from the decline in the rate of identification.  So perhaps the most useful upper bound would be to take E1 and E2 to be the rate-adjusted total-findable estimates (instead of the currently identified count) for the two different teams.

Another case where the selection process might not have been random is Petersen's fish experiment.  If 1/7th of a fish population had holes punched in their dorsal fins, and 1/5th of later catches were found with punched dorsal fins, it kind of suggests that punching holes in dorsal fins might make fish easier to catch.  Either that, or adolescent European Plaice are susceptible to conformal peer pressure...</description>
		<content:encoded><![CDATA[<p>Interesting article.  I wonder how well this kind of thing could apply to software quality results (and in particular for my interests, security vulnerabilities).  I.e. if Lincoln finds E1 defects and Petersen finds E2 defects, with S in common, can we safely assume that there are E1E2/S defects?</p>
<p>I don&#8217;t think so, because testing for vulnerabilities is generally not a simple random sampling mechanism.  Much depends on the method used for testing - static analysis versus fuzzing versus manual inspection or grey-box testing for example.  </p>
<p>Certain testing methods will typically only find certain classes of vulnerabilities for example, and with varying degrees of effectiveness.  Fuzzing (pseudo-random input with varying degrees of structure) might be very good at uncovering buffer overflows and other parsing or data-plane/semantic-boundary transition problems, but it typically doesn&#8217;t do as well at identifying privilege or permission bypass issues.</p>
<p>On the other hand, if you had two different teams fuzzing your firewall for example, and they came up with different results with some subset of identical results, it might make sense to apply the Lincoln index and estimate the total number of defects that could eventually be discovered by fuzzing, since fuzzing is essentially a random selection process.  </p>
<p>The other interesting thing about fuzzing is that it tends to be subject to diminishing returns, so you might also be able to estimate the total &#8220;findable&#8221; number from the decline in the rate of identification.  So perhaps the most useful upper bound would be to take E1 and E2 to be the rate-adjusted total-findable estimates (instead of the currently identified count) for the two different teams.</p>
<p>Another case where the selection process might not have been random is Petersen&#8217;s fish experiment.  If 1/7th of a fish population had holes punched in their dorsal fins, and 1/5th of later catches were found with punched dorsal fins, it kind of suggests that punching holes in dorsal fins might make fish easier to catch.  Either that, or adolescent European Plaice are susceptible to conformal peer pressure&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Greg Wilson</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3035</link>
		<dc:creator>Greg Wilson</dc:creator>
		<pubDate>Mon, 12 Jul 2010 10:05:24 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3035</guid>
		<description>Ironically, there's a typo in your article's penultimate paragraph: the sentence "I’d say he has a good claim, except that Pierre Simon de Laplace more than a century earlier." is incomplete.
Thanks for another entertaining post!</description>
		<content:encoded><![CDATA[<p>Ironically, there&#8217;s a typo in your article&#8217;s penultimate paragraph: the sentence &#8220;I’d say he has a good claim, except that Pierre Simon de Laplace more than a century earlier.&#8221; is incomplete.<br />
Thanks for another entertaining post!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Greg</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3034</link>
		<dc:creator>Greg</dc:creator>
		<pubDate>Mon, 12 Jul 2010 05:14:43 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3034</guid>
		<description>I don't know about the population estimate, but the point you're trying to make at the end about this being a simple calculation that obviously recovers the total number of errors is wrong. I think you must be thinking of the inclusion/exclusion principle. Assuming that the two copyreaders find all the errors between them, the total errors will be exactly E1+E2-S. There's no reason for multiplication or division to come into this.</description>
		<content:encoded><![CDATA[<p>I don&#8217;t know about the population estimate, but the point you&#8217;re trying to make at the end about this being a simple calculation that obviously recovers the total number of errors is wrong. I think you must be thinking of the inclusion/exclusion principle. Assuming that the two copyreaders find all the errors between them, the total errors will be exactly E1+E2-S. There&#8217;s no reason for multiplication or division to come into this.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Cowan</title>
		<link>http://bit-player.org/2010/the-thrill-of-the-chase#comment-3033</link>
		<dc:creator>John Cowan</dc:creator>
		<pubDate>Mon, 12 Jul 2010 04:44:16 +0000</pubDate>
		<guid isPermaLink="false">http://bit-player.org/?p=706#comment-3033</guid>
		<description>Unfortunately yes.  I was trying to recover the lyrics of a song I learned about the U.S. Presidents as a kid, to the tune of "Yankee Doodle".  There are a few very different versions on line, but the one that I recognize goes only up through Grover Cleveland, yet I remember it through Eisenhower.  The rest is apparently lost, at least until someone bothers to put it online.</description>
		<content:encoded><![CDATA[<p>Unfortunately yes.  I was trying to recover the lyrics of a song I learned about the U.S. Presidents as a kid, to the tune of &#8220;Yankee Doodle&#8221;.  There are a few very different versions on line, but the one that I recognize goes only up through Grover Cleveland, yet I remember it through Eisenhower.  The rest is apparently lost, at least until someone bothers to put it online.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

