To start writing about football statistics, I first needed a clean data set. The problem is, there is no football stats database available online with a public API. To obtain this data set, I had to collect the data by myself by retrieving it from big football websites.

In this article, I explain how I did it using Perl programming language. I also share the problems I encountered in the process.

Choose a football website

Before starting to download a full football website with a crawler (like HTTrack), verify that it’s easy to parse:

Stats must be in HTML code directly, not downloaded by Javascript after the fact.
The layout must be the same for each HTML match page. Perform manual verification on multiple pages to make sure the structure is consistent.
The list of matches must be easy to explore. Avoid Javascript pagination systems.
Compare the statistics of some matches against other websites. You would be very disappointed to discover your website has wrong stats after the end of a painful extraction process.
In this article, I don’t use an advanced crawler to retrieve the web pages. I just use Perl programming language.

Retrieve HTML match pages

Firstly, we manually download a page with links to matches. Then, we use WWW::Mechanize to find the links.

my $mech = WWW::Mechanize->new();
$mech->get("matchListPage.html");
my @links = $mech->links();
my @urls = map { $_->url } @links;

@urls contains the links of the whole page. We filter it with grep to reject the links that don’t follow the URL pattern of matches. For example, if the match page URLs follow this pattern football.pwal.ch/match/2013-10-25-Arsenal-United and we don’t want pwal.ch/something_else_than_a_football_match to appear, we write this:

@urls = grep(/match\//, @urls);

Now, we’ve got all the links pointing to matches. To download these match pages, you can use LWP:Simple with the getstore() routine.

my $httpStatus = getstore($url, "$matchOutputFilename");
if(! is_success($httpStatus)) {
    # Download the page again, failed tries may happen
}

As stated in the code, trials may fail so don’t hesitate to use getstore() again for the same match if the first download is not a success. It’s better to make a supplementary try now than to find the match you’re missing in a list of 380 matches (number of matches in a 20 teams championship).

Also, we can use use Time::HiRes to sleep between tries and avoid simulating a DDOS attack on the server we are exploring:

Time::HiRes::sleep(0.5);

After having retrieved all the HTML match pages, we verify that the total number of files is correct. This Shell command gives the number of HTML files in a directory recursively:

find directory_to_explore -name "*.html" | wc -l

We have all the matches, so we can continue with data extraction. Don’t start to collect all the pages, you don’t know if the pages you downloaded contain correct information. Extraction comes first.

Notes:

After having performed verification of the data, you will also use the WWW::Mechanize + LWP::Simple combo to retrieve the match list pages in addition to just match pages.
Think about compressing your data sets to make them smaller and easier to manipulate. Copying one compressed file is much faster than moving 380 raw HTML files.

Transform HTML pages into Perl data structures

To extract a string from a text file, we use read_file() from File::Slurp module:

my $matchText = read_file($matchPath);

Then, we create a Perl hash to represent the match entity:

my %match = ();

After this, we can use either a HTML parser (HTML::Parser module) or regular expressions to extract the data we want and store it in the hash. As the structure of the website I parsed was not very consistent, I chose regular expressions:

# Half-time score
if($matchText =~ m/Half-time<\/dt>\s*(\d+) - (\d+)<\/dd>/) {
    $match{home}{halftimegoals} = $1;
    $match{away}{halftimegoals} = $2;
}

You might encounter Unicode escaping problems when extracting some team or player names. To avoid them, make sure you use UTF-8 at the beginning of your program:

use utf8;

After we’ve filled our hash with all the data, we have to store it in the file system. We can store it in multiple ways, with Storable (binary), Data::Dumper (Perl data structure strings) or even JSON::PP (JSON objects).

The simplest method is the one with Data::Dumper. We just give the routine our hash and it makes a string of Perl hash code out of it. However, we have to modify some settings of $Data::Dumper before using it:

$Data::Dumper::Purity = 1;
$Data::Dumper::Terse = 1;
my $dataStructureOutputText = Data::Dumper->Dump([\%match],["match"]);

Even if Data::Dumper is very easy to use, you cannot use Perl data structure strings in other languages directly. With JSON::PP, we can do this to get the corresponding JSON string of a Perl data structure:

my $jsonString = encode_json(\%match);

We can also pretty print the JSON string using another function:

my $prettyPrintedMatch = $json->pretty->encode(\%match);

To finish, we use write_file() from File::Slurp to save the data structure to a file:

write_file($matchDatafilePath, $dataStructureTextOutput);

Verify your database

To reopen files we wrote with Data::Dumper, we simply use read_file() from File::Slurp and evaluate the string:

my $matchText = read_file($matchFilePath);
my %match = %{ eval $matchText };

If you chose JSON, you change the second line of the previous code to:

my $matchText = read_file($matchFilePath);
my %match = %{decode_json($matchText)};

Even if you can actually analyze the data now, don’t start to perform requests on your database before verification. It often contains errors you did not expect and that may not come from your side.

Here are some problems I encountered:

The website had a bug with the pagination system. The first match of the last page was not shown consistently. Sometimes it appeared, sometimes not.
The website showed duplicates in the last two pages.
Some webpages of the same season and of the same championship did not have the same amount of data (missing lineup, missing goal minutes, …).
The website used abbreviations inconsistently, so I had to establish correspondences between abbreviations and actual full names.

However, there was an even worse problem: some statistics were wrong. I found it by comparing the mean of some stats to the values displayed on other websites. To fix this type of inaccuracies, the fifth step was added.

Suggestion: merge your database with another one

To make sure your database is correct, you could find another one and compare the results. You could take the best parts of each side and merge them to form a correct set of matches. In my case, one data set did not feature correct shot stats but provided lineups/cards/substitutions, while the other one had some incorrect dates but provided correct shot stats.

To establish the correspondence between the matches, we can take the names of the teams and the date of the match. Multiple matches between the same teams in only one day (almost) never happen. We use grep() to find the pairs:

my @correspondingMatches = grep {
    $secondSetMatch{date} == $_->{date}
    and $secondSetMatch{home}{name} eq $_->{home}{name}
    and $secondSetMatch{away}{name} eq $_->{away}{name}
} @firstSetMatches;

Force consistency of the team names of each set before merging them to avoid problems when forming pairs. Also think about rare cases like cancelled matches won 3-0.

Fortunately, after having done all these steps (including compression), you are ready for the most interesting step: analysis.

Analyze your database

The first step here is to write routines to find the type of data you’re manipulating. Establish match categories to separate the games with all the stats (cards, possession, …) from the games only with the scoreline. If you don’t do this, you might have unpleasing surprises when performing requests. For example, you might not notice you are trying to extract the minute of the red card received by a player from a file containing only a score line.

With this last verification step, you should be able to analyze your database “safely”.

Have fun!