PySpark: great-circle distance

2021, Mar 23    

Suppose we want to compute the distance between millions of GPS points (lat, lon) with PySpark.
We could write a PySpark UDF that wraps a distance function of a python library like geopy. The problem with UDFs or vectorized UDFs is that they suffer from high serialization and invocation overhead thus they are SLOW (they are written in python)!
We shouldn’t use UDFs unless we have no other choice!

Here you can find a pure PySpark implementation of the great-circle distance (in km) between two GPS points p1 and p2

import pyspark.sql.functions as F

def distance(lat_p1, lon_p1, lat_p2, lon_p2):
    '''
    Calculates the great-circle distance (in km) between two 
    GPS points p1 and p2
    https://en.wikipedia.org/wiki/Great-circle_distance#Formulae
    -------------------------------------
    :param lat_p1: latitude of origin point
    :param lon_p1: longitude of origin point
    :param lat_p1: latitude of destination point
    :param lon_p1: longitude of destination point
    :returns: distance in km
    '''
    return F.acos(
        F.sin(F.toRadians(lat_p1)) * F.sin(F.toRadians(lat_p2)) + 
        F.cos(F.toRadians(lat_p1)) * F.cos(F.toRadians(lat_p2)) * 
            F.cos(F.toRadians(lon_p1) - F.toRadians(lon_p2))
    ) * F.lit(6371.0)