To start writing about football statistics, I first needed a clean data set. The problem is, there is no football stats database available online with a public API. To obtain this data set, I had to collect the data by myself by retrieving it from big football websites.
In this article, I explain how I did it using Perl programming language. I also share the problems I encountered in the process.
Before starting to download a full football website with a crawler (like HTTrack), verify that it’s easy to parse:
Firstly, we manually download a page with links to matches. Then, we use WWW::Mechanize to find the links.
my $mech = WWW::Mechanize->new();
$mech->get("matchListPage.html");
my @links = $mech->links();
my @urls = map { $_->url } @links;
@urls
contains the links of the whole page. We filter it with grep to reject the links that don’t follow the URL pattern of matches. For example, if the match page URLs follow this pattern football.pwal.ch/match/2013-10-25-Arsenal-United
and we don’t want pwal.ch/something_else_than_a_football_match
to appear, we write this:
@urls = grep(/match\//, @urls);
Now, we’ve got all the links pointing to matches. To download these match pages, you can use LWP:Simple with the getstore()
routine.
my $httpStatus = getstore($url, "$matchOutputFilename");
if(! is_success($httpStatus)) {
# Download the page again, failed tries may happen
}
As stated in the code, trials may fail so don’t hesitate to use getstore()
again for the same match if the first download is not a success. It’s better to make a supplementary try now than to find the match you’re missing in a list of 380 matches (number of matches in a 20 teams championship).
Also, we can use use Time::HiRes to sleep between tries and avoid simulating a DDOS attack on the server we are exploring:
Time::HiRes::sleep(0.5);
After having retrieved all the HTML match pages, we verify that the total number of files is correct. This Shell command gives the number of HTML files in a directory recursively:
find directory_to_explore -name "*.html" | wc -l
We have all the matches, so we can continue with data extraction. Don’t start to collect all the pages, you don’t know if the pages you downloaded contain correct information. Extraction comes first.
Notes:
WWW::Mechanize
+ LWP::Simple
combo to retrieve the match list pages in addition to just match pages.To extract a string from a text file, we use read_file()
from File::Slurp module:
my $matchText = read_file($matchPath);
Then, we create a Perl hash to represent the match entity:
my %match = ();
After this, we can use either a HTML parser (HTML::Parser module) or regular expressions to extract the data we want and store it in the hash. As the structure of the website I parsed was not very consistent, I chose regular expressions:
# Half-time score
if($matchText =~ m/Half-time<\/dt>\s*(\d+) - (\d+)<\/dd>/) {
$match{home}{halftimegoals} = $1;
$match{away}{halftimegoals} = $2;
}
You might encounter Unicode escaping problems when extracting some team or player names. To avoid them, make sure you use UTF-8 at the beginning of your program:
use utf8;
After we’ve filled our hash with all the data, we have to store it in the file system. We can store it in multiple ways, with Storable (binary), Data::Dumper (Perl data structure strings) or even JSON::PP (JSON objects).
The simplest method is the one with Data::Dumper
. We just give the routine our hash and it makes a string of Perl hash code out of it. However, we have to modify some settings of $Data::Dumper
before using it:
$Data::Dumper::Purity = 1;
$Data::Dumper::Terse = 1;
my $dataStructureOutputText = Data::Dumper->Dump([\%match],["match"]);
Even if Data::Dumper
is very easy to use, you cannot use Perl data structure strings in other languages directly. With JSON::PP
, we can do this to get the corresponding JSON string of a Perl data structure:
my $jsonString = encode_json(\%match);
We can also pretty print the JSON string using another function:
my $prettyPrintedMatch = $json->pretty->encode(\%match);
To finish, we use write_file()
from File::Slurp
to save the data structure to a file:
write_file($matchDatafilePath, $dataStructureTextOutput);
To reopen files we wrote with Data::Dumper
, we simply use read_file()
from File::Slurp
and evaluate the string:
my $matchText = read_file($matchFilePath);
my %match = %{ eval $matchText };
If you chose JSON, you change the second line of the previous code to:
my $matchText = read_file($matchFilePath);
my %match = %{decode_json($matchText)};
Even if you can actually analyze the data now, don’t start to perform requests on your database before verification. It often contains errors you did not expect and that may not come from your side.
Here are some problems I encountered:
However, there was an even worse problem: some statistics were wrong. I found it by comparing the mean of some stats to the values displayed on other websites. To fix this type of inaccuracies, the fifth step was added.
To make sure your database is correct, you could find another one and compare the results. You could take the best parts of each side and merge them to form a correct set of matches. In my case, one data set did not feature correct shot stats but provided lineups/cards/substitutions, while the other one had some incorrect dates but provided correct shot stats.
To establish the correspondence between the matches, we can take the names of the teams and the date of the match. Multiple matches between the same teams in only one day (almost) never happen. We use grep()
to find the pairs:
my @correspondingMatches = grep {
$secondSetMatch{date} == $_->{date}
and $secondSetMatch{home}{name} eq $_->{home}{name}
and $secondSetMatch{away}{name} eq $_->{away}{name}
} @firstSetMatches;
Force consistency of the team names of each set before merging them to avoid problems when forming pairs. Also think about rare cases like cancelled matches won 3-0.
Fortunately, after having done all these steps (including compression), you are ready for the most interesting step: analysis.
The first step here is to write routines to find the type of data you’re manipulating. Establish match categories to separate the games with all the stats (cards, possession, …) from the games only with the scoreline. If you don’t do this, you might have unpleasing surprises when performing requests. For example, you might not notice you are trying to extract the minute of the red card received by a player from a file containing only a score line.
With this last verification step, you should be able to analyze your database “safely”.
Have fun!