# PySpark: great-circle distance

2021, Mar 23

Suppose we want to compute the distance between millions of GPS points (lat, lon) with PySpark.

We could write a PySpark UDF that wraps a distance function of a python library like geopy. The problem with UDFs or vectorized UDFs is that they suffer from high serialization and invocation overhead thus they are SLOW (they are written in python)!

**We shouldnâ€™t use UDFs unless we have no other choice!**

Here you can find a pure PySpark implementation of the great-circle distance (in km) between two GPS points *p1* and *p2*

```
import pyspark.sql.functions as F
def distance(lat_p1, lon_p1, lat_p2, lon_p2):
'''
Calculates the great-circle distance (in km) between two
GPS points p1 and p2
https://en.wikipedia.org/wiki/Great-circle_distance#Formulae
-------------------------------------
:param lat_p1: latitude of origin point
:param lon_p1: longitude of origin point
:param lat_p1: latitude of destination point
:param lon_p1: longitude of destination point
:returns: distance in km
'''
return F.acos(
F.sin(F.toRadians(lat_p1)) * F.sin(F.toRadians(lat_p2)) +
F.cos(F.toRadians(lat_p1)) * F.cos(F.toRadians(lat_p2)) *
F.cos(F.toRadians(lon_p1) - F.toRadians(lon_p2))
) * F.lit(6371.0)
```