A Simple Heuristic for Random Error in Possibly Google-able Coordinates

Nov. 21, 2021

For my project mapping income in China during the 1960s with Oliver Kim, we had to decide whether or not to trust the provided coordinates of some of our village ground truth data. We figured villages are pretty sticky in their location over time, so it might make sense to just Google them and use those coordinates. A first pass showed that this isn't so simple and there are lots of villages with the same or similar-enough names in China to foil a naive approach. After supplying additional information, such as the province, we were able to get some results that appeared reasonable but were wary.

To decide whether or not we should stick with the provided or Google-derived coordinates, I devised a simple heuristic based on a test whose null hypethesis is that the Google coordinates are correct and the provided coordinates are noisy in a random direction (that is independent of the distance). The heuristic statistic is

$T=\sum\limits_{i=1}^N [(ProvidedCoords_i \notin County_i) - (\textrm{share of circle with radius }d(ProvidedCoords_i, GoogleCoords_i)\textrm{'s circumference}\notin County_i)]$

and when it is close to zero, we fail to reject the null. The intuition is as follows: if the provided coordinates are noisy in a random direction, then conditional on how far they're displaced they should appear outside of some boundary as often as if they were randomly placed in a different direction (in expectation). That latter number is the share of the circle centered on the Google coordinates with radius of the distance between the pairs of coordinates that lies outside the boundary. I illustrate the test below using Chinese provinces as the boundaries.

Now I'll make some fake Google coordinates of the villages, which will be the true coordinates under the null hypothesis.

Next, we have to make the noisy coordinates, which will be displaced by a random direction and distance.

Now we need to make the test statistic.

As you can see, this is quite close to 0, suggesting that the Google coordinates are correct. When we applied this test to our true data we also got test statistic very close to zero which helped us decide to just use the Google coordinates.