The second issue is we can do better. I am happy we now live in a time when programmers write their own tests, but we do not hold ourselves to the same standards with tests as we do with production code. That is a reasonable trade-off; there are only so many hours in the day. But what LLMs lack in arithmetical prowess, they make up for in enthusiasm.
Let’s ask for an even better test.
In the tests, implement the simplest, most readable version of the standard code for quartiles over a fixed set of known values in a slice. Then pass the test cases through the standard code and the reservoir sampler and confirm they are within an epsilon of each other. Structure the comparison code such that it can be used in a fuzz test, too.
This got us some new test code:
// referenceQuartiles calculates the exact quartiles for a slice of float64 values
// using linear interpolation, matching the behavior expected from the sampler.
func referenceQuartiles(data []float64) (q1, median, q3 float64) { … }
// compareQuartiles checks if two sets of quartiles are within epsilon of each other.
// Returns true if they match within the tolerance, false otherwise.
func compareQuartiles(q1a, meda, q3a, q1b, medb, q3b, epsilon float64) bool { … }
// checkQuartiles is a test helper that compares sampler output against the reference
// implementation and reports any differences.
func checkQuartiles(t *testing.T, data []float64, epsilon float64) {
t.Helper()
// Get reference values
wantQ1, wantMed, wantQ3 := referenceQuartiles(data)
// Get sampler values using a large reservoir for accuracy
qs := NewQuartileSampler(1000)
for _, v := range data {
qs.Add(v)
}
gotQ1, gotMed, gotQ3 := qs.Quartiles()
if !compareQuartiles(gotQ1, gotMed, gotQ3, wantQ1, wantMed, wantQ3, epsilon) {
t.Errorf("Quartiles mismatch:\ngot (q1=%v, med=%v, q3=%v)\nwant (q1=%v, med=%v, q3=%v)\nepsilon=%v",
gotQ1, gotMed, gotQ3, wantQ1, wantMed, wantQ3, epsilon)
}
}
The original test from above has been reworked to use checkQuartiles, and we have something new:
func FuzzQuartileSampler(f *testing.F) {
// Add some seed corpus
f.Add([]float64{1, 2, 3, 4, 5})
f.Fuzz(func(t *testing.T, data []float64) {
// Use a larger epsilon for fuzzing since we might get more extreme values
checkQuartiles(t, data, 0.2)
})
}
This is fun because it’s wrong. My running gopls
tool immediately says:
fuzzing arguments can only have the following types:
string, bool, float32, float64,
int, int8, int16, int32, int64,
uint, uint8, uint16, uint32, uint64,
[]byte
Pasting that error back into the LLM gets it to regenerate the fuzz test such that it is built around a func(t *testing.T, data []byte)
function that uses math.Float64frombits
to extract floats from the data slice. Interactions like this point us toward automating the feedback from tools; all it needed was the obvious error message to make solid progress toward something useful. I was not needed.
Doing a quick survey of the last few weeks of my LLM chat history shows (which, as I mentioned earlier, is not a proper quantitative analysis by any measure) that more than 80 percent of the time there is a tooling error, the LLM can make useful progress without me adding any insight. About half the time, it can completely resolve the issue without me saying anything of note. I am just acting as the messenger.