A quest to display all football clubs of the world on a map
Aggregating publicly available club data and displaying it on Google Maps with an adaptive zoom based on power rankings.

As a big football fan, I've always loved to meet new people around the world who share my passion of the game. While staying abroad for vacation, I wondered which clubs the local football enthusiasts supported. After talking to many of them, it looked like they all supported the club of their capital city, while also being very proud of their small local club. Then I wondered, what if we could raise awareness about all these smaller clubs, and give them the light they deserve? There, the quest to place all the professional football clubs on the world map began.

The idea was to first semi-automatically collect publicly available club data through a variety of online sources, and then to display their logo and their name on Google Maps. To choose which clubs to display at a specific zoom level, I would take into account the relative strength of each club, using the power ranking of the club within its confederation.

After all these steps, the final result can be seen by clicking the big button "Open Club World Map" below:

mapOpen Club World Map

In case you cannot access the map, here are a few screenshots taken from different places on the planet:
Western Europe
(click to enlarge)
London, UK
(click to enlarge)

Even though the final result looks simple and clean, the data collection process turned out to be more complex than expected because of the sparsity of the sources and the inconsistencies of the formats.

The next section summarizes the data collection process that was built to address these issues, while proposing insights about the main challenges that were encountered.

Collect the club metadata

As I started from scratch, I did not have any initial database of clubs. Indeed, there were a few dumps of football data created by the analytics community online, but none of them contained all the relevant fields: club name, stadium geographical location, or logo. For this reason, I decided to use as a first pass the publicly available club metadata directly from the HTML web pages of a commercial website: Soccerway.com.

To achieve this, I parsed the HTML web pages using BeautifulSoup. As the development of the parsing functions implied a lot of trial-and-error, I also used the CachedSession wrapper of the requests_cache package, which makes sure the HTTP client actually hits the website only on cache misses. Even though the CachedSession class was not advertised on the main page of the package, instantiating it explicitly instead of using a global requests_cache.install_cache(cache_name) call turned out to make the caching logic more generalizable to all my scripts.

# Before
import requests
import requests_cache
# globally augment the requests API with a cache
requests_cache.install_cache("my_cache")
response = requests.get("http://pwal.ch/")

# After
from requests_cache.core import CachedSession
# locally instantiate a requests API with a cache
requests_session = CachedSession(SOCCERWAY_LEAGUE_CACHE_NAME)
response = requests_session.get("http://pwal.ch")

After looking at the structure of the website, I identified three layers of data: countries, leagues/competitions and teams/clubs. That is, the root page of the teams was at the level of countries, which led to all leagues of each country, which in turn led to teams of each competition. Since I knew I would need the confederation later to rank the clubs, I manually labeled the confederation for each country:

{
  "...": "...",
  "Kenya": "CAF",
  "Korea Republic": "AFC",
  "Kosovo": "UEFA",
  "Kuwait": "AFC",
  "Kyrgyzstan": "AFC",
  "...": "..."
}

After downloading all the competitions for all countries, I manually filtered them to include only the professional leagues. Indeed, the end goal was to display a logo and a geographical location, which might not be available very often for lower level clubs.

Finally, I downloaded the metadata of each team of each league, resulting in a JSON file with the full list of 4521 teams. Below is an example of complete metadata for Olympiakos in Greece.

{
  "_confederation_name": "UEFA",
  "_country_name": "Greece",
  "_id": 1040,
  "_league_name": "Super League",
  "address": [
    "Plateia Alexandras",
    "18534",
    "Pireás (Piraeus)"
  ],
  "country": "Greece",
  "founded": "1925",
  "logo": "http://cache.images.core.optasports.com/soccer/teams/150x150/1040.png",
  "long_name": "Olympiakos CFP",
  "name": "Olympiakos Piraeus",
  "venue_capacity": "33296",
  "venue_city": "Pireás (Piraeus)",
  "venue_name": "Stadio Georgios Karaiskáki"
}

Most of the fields were obvious, but a few had surprising data. For example, the given address was often the address of the headquarters, which could be in a different city from the stadium. Also, a logo file was surprisingly always available for all clubs, saved using the ID of the club. Unfortunately, a lot of these logo files were actually identical and consisted of a gray/white image with the text "logo coming soon". These problems are addressed in the following sections.

Collect the logos

Identify the missing logos

Placeholder

For many teams, the logo image file was directly available from Soccerway. However, most of the logo files turned out to be just placeholders instead of the actual logos. These placeholders were the exact same image file, with the "logo coming soon" text shown on the right-hand side.

To identify all these placeholder images and find for which clubs I had a genuine logo, I computed the MD5 hash of the placeholder and computed the list of logos that matched this hash:

from hashlib import md5

logoless_id_list = []
for logo_path in logo_path_list:
    with open(logo_path, "rb") as logo_file:
        logo_data = logo_file.read()

    md5_sum = md5(logo_data).hexdigest()
    if md5_sum != OFFICIAL_LOGO_SOON_HASH:
        continue

    matches = re.search(r"(\d+)\.png", logo_path)
    if not matches:
        continue

    soccerway_id = int(matches.group(1))
    logoless_id_list.append(soccerway_id)

This process resulted in a list of logoless teams, for which a logo would be obtained later from somewhere else.

Find the missing logos from another source

Olympiacos Wiki page infobox
(click to enlarge)

While googling for logos, I realized most of them were often available on Wikipedia. For this reason, I decided to search for the 2231 logoless teams using the Wikimedia search API, hoping their wiki page would contain the logo. For example, here is an infobox for Olympiakos FC on Wikipedia.

Fortunately for me, someone implemented the wikipedia package for Python, which abstracts away the Wikimedia API.

Even though the logo was often present in the Wiki page, its location within the page was inconsistent. To handle as many cases as possible, I took the first image I could find in the first few fields of the info box displayed on the right-hand side. Indeed, the logo was often somewhere at the top of this box.

Thanks to this process, an image file was found for 1277 clubs. Unfortunately, many of them were actually not the logos I was looking for. Here are a few examples of the cases that happened:

  • when searching for a not-so-popular club, the city where the club is located would be found instead of the club itself.
  • some wiki pages did not feature a logo, but the uniform of the team.
  • the logo was not in the first few lines of the infobox

To correct the mistakes, I renamed all the downloaded image files to include the name of the club, and used the image preview tool of my OS to see whether the image really corresponded to a logo. If the image did not correspond, I tried to find the correct one by googling the team name, and then replacing the incorrect file.

As these images did not always have a consistent format, I applied several image transformations with ImageMagick to enforce one:

# ensure same image format:
mogrify -format png *.gif *.jpeg *.jpg *.JPG *.PNG

# make white background transparent:
convert image-with-background.png -bordercolor white -border 1x1 -alpha set -channel RGBA -fuzz 1% -fill none -floodfill +0+0 white -shave 1x1 image-without-background.png

# crop to non-transparent area
convert -trim input.png cropped.png

# resize to consistent height
convert -geometry x150 input.png resized.png

Following these correction steps, I had a fairly good coverage of the team logos, with 2290 teams.

Locating the teams on the world map

In order to display all the teams on Google's world map, they required the latitude and longitude coordinates of each location. In case you are not familiar with this coordinate system, it is simply a way to specify a location on Earth. Below is an image explaining that, picked from Wikimedia Commons and created by Djexplo:

Latitude and longitude on the globe

These coordinates were obviously not available in the teams metadata, but could be computed from the geographical data of the fields.

In the team metadata, there were two relevant fields for the location: the club registered address, the venue name and the venue city. I first made sure there was at least a city for each team by manually correcting the entries where both the address and the venue city were missing.

Then, I used Google's geocoding API to transform the address into latitude and longitude coordinates to be displayed on the map. My geocoding process was the following:

  • if stadium information was available
    • try "stadium_name city_name country"
    • if the above failed, fall back to "city country"
  • if not, try "city country"

Thanks to this process, the geocoding API could almost always find the coordinates.

However, in some cases several teams would share the same location, these coordinates would be identical and the dots would later be displayed at the exact same location. To prevent this from happening, I made the assumption that the Earth was flat at the order of magnitude of a few kilometers, and distributed the teams evenly on a circle of 50 meters radius using polar coordinates:

def separate_clustered_geolocations(clustered_geolocation_list):
    separated_geocoding_list = []
    for group in clustered_geolocation_list:
        group_size = len(group)

        if group_size == 1:
            separated_geocoding_list.append(group[0])
            continue

        for index, geocoding in enumerate(group):
            angle = index / float(group_size) * 2.0 * pi
            original_location = geocoding["location"]

            geocoding_copy = geocoding.copy()
            geocoding_copy["location"] = shift_location(original_location, angle)

            separated_geocoding_list.append(geocoding_copy)

    return separated_geocoding_list


def shift_location(original_location, angle):
    return {
        "longitude": original_location["longitude"] + RADIUS_LAT_LNG_DEGREES * cos(angle),
        "latitude": original_location["latitude"] + RADIUS_LAT_LNG_DEGREES * sin(angle)
    }

After this process, the teams sharing the same stadium would not overlap anymore and be displayed on a circle. Here are two examples of several clubs occupying the same stadium.

San Siro
Roma, Italy
Stade de Magenta
Noumea, New Caledonia

You can see that the formula based polar coordinates and the assumption of a flat earth works well, and that the clubs are arranged in a nice circle.

Rank the teams

As the map would have become totally unreadable if the clubs were displayed all at once, I decided to display at most 30 clubs. As there can be a lot of clubs at a specific map zoom level, choosing the 30 most interesting clubs objectively is not obvious.

I decided to avoid re-inventing the wheel and to take advantage of existing club rankings such as the Euro Club Index. For each confederation, I extracted a team name and a number of points, and matched the name to my team metadata semi-automatically using a similarity score:

from difflib import SequenceMatcher as SM

def find_similar_soccerway_team(euro_club_name, soccerway_country_team_list):
    best_corresponding_team = None
    best_similarity_score = 0
    for soccerway_team in soccerway_country_team_list:
        soccerway_team_name = soccerway_team["long_name"]

        score = SM(None, euro_club_name, soccerway_team_name).ratio()

        if score >= best_similarity_score:
            best_corresponding_team = soccerway_team
            best_similarity_score = score

    if best_corresponding_team is None:
        raise RuntimeError(
            "Soccerway country team list is empty for index club: {}".format(
                euro_club_name))

    return best_corresponding_team, best_similarity_score

For reference, the rankings for each confederation have been extracted from the following websites:

For OFC, as there was no available club ranking, I computed an ad-hoc score based on the results of the OFC Champions League for the last three seasons.

After matching each team of the club indices to my team metadata for each confederation, they all had different scales for their scores. For this reason, I rescaled them between 0 and 1, and then weighted them according to the number of spots their confederation gets in the World Cup. Finally, I scaled them again between 0 and 1000 to get a human-readable score.

As an example, here are the top 5 teams of the ranking:

<tr>
  <td>FC Barcelona</td>
  <td>Spain</td>
  <td>416.02</td>
</tr>

<tr>
  <td>FC Bayern München</td>
  <td>Germany</td>
  <td>389.07</td>
</tr>

<tr>
  <td>Paris Saint-Germain FC</td>
  <td>France</td>
  <td>375.02</td>
</tr>

<tr>
  <td>Club Atlético de Madrid</td>
  <td>Spain</td>
  <td>374.85</td>
</tr>
Club name Country Points
Real Madrid CF Spain 419.35

Aggregate all the information from all sources

After collecting information from all these different sources, the last step was to aggregate them all in one place and to make them usable for the map web-app. For this, I simply joined all the sources and sorted the clubs by score, while prioritizing the teams that had a logo. For example, here is the first entry:

{
    "address_lines": [
        "Avenida de Concha Espina 1, Chamart\u00edn",
        "28036",
        "Madrid"
    ],
    "confederation": "UEFA",
    "country": "Spain",
    "foundation_date": "1902",
    "has_logo": true,
    "id": 2016,
    "index_name": "Real Madrid",
    "index_points": 4413.27802,
    "league": "Primera Divisi\u00f3n",
    "location": {
        "latitude": 40.4530541,
        "longitude": -3.6883445
    },
    "name": "Real Madrid CF",
    "score": 419.35483870967744,
    "venue_capacity": "85454",
    "venue_city": "Madrid",
    "venue_name": "Estadio Santiago Bernab\u00e9u"
}

Finally, to provide a clean output for the web-app, I applied a webification script that filtered out most of the fields and kept only the essential ones for the web-app, while also mapping the IDs to an integer between from 0 to 4443 (club count). Below is the webified version of the Real Madrid entry:

{
    "country": "Spain",
    "has_logo": true,
    "id": 531,
    "league": "La Liga",
    "location": {
        "lat": 40.4530541,
        "lng": -3.6883445
    },
    "name": "Real Madrid CF",
    "stadium": "Estadio Santiago Bernab\u00e9u"
}

Web-app

After all the steps of the data collection pipeline described in the previous section, developing the web-app was fairly straightforward.

I used a JavaScript-based static site generator called Metalsmith with a few plugins to simplify development and the Google Maps API (tutorial) to display the map and the markers.

Since the Python scripts already took care of sorting by presence of logo and score offline, the JavaScript code handling the logic of the map was very simple to implement and holds in less than 100 lines of code.

Here is an example of the end result for Buenos Aires, Argentina:

Overview of the map in Buenos Aires

Conclusion & future work

In this blog post, I described a pipeline for football club data collection and a web-app to visualize this data. The data collection pipeline consisted of extracting the team metadata from a big football website, completing it with geolocation information from the Google Maps API and logos from Wikipedia, and finally ranking them according to confederation rankings.

Before starting this project, I knew the data collection pipeline would be harder than the web-app. However, I underestimated how incomplete online data sources were, including popular live score websites. The data collection pipeline took an order of magnitude more development time than the web-app and involved more manual correction steps than expected. Fortunately, the web-app was quite easy thanks to the well-established Google Maps JavaScript tooling.

For the future, an obvious improvement would be to improve the data set by adding more logos and geolocations, particularly for clubs playing in lower divisions. Apart from that, one could also collect data for all amateur clubs within a country, and display all of them on the map. In this case, ranking might be easier because there is a well-defined pyramid of divisions within each national federation. Regarding the web-app, live score information could be integrated to provide relevant information about the clubs, such as current game score, the latest results, statistics, ...