(Unofficial) Eurovision: Open Up


Introduction

A help to shore up a post war Europe in 1956 it all began, where there were only seven countries and one camera man!

The Eurovision Song Contest is a long-running, live-broadcast televised multi-national competition with a collaborative mission, not dissimilar in spirit to the Olympics. The contest has grown significantly from that modest start with 7 countries (and one cameraman), with over 40 countries competing these days—Australia even takes part now, through a specially arranged invitation. It's an annual celebration of European culture and the highlight of many people's year.

At Eurovision there is no division because wherever you come from Eurovision is home. The Eurovision song contest is widely known as a safe space for LGBTQIA+ people and a platform for free expression. For example trans-woman Dana International won as far back as 1998. There have been songs in many different languages over the years, although most are in English these days. This doesn't matter, however, because music is a language we all know how to speak.

In its latest incarnation, after all the performances are over, artists wait nervously as via live television link-ups the show's hosts visit each of the 40+ countries in turn collecting all points cast by the country appointed juries. This includes the all important top score that can be cast, 12 points (douze points!), a double-increment up from the 10 points awarded to the song a country ranks second, followed by 8, 7, 6 … 1 points awarded. With over 20 countries competing in a final, this means that not all performers gets points from that country. Next comes the "the popular vote" where fans, still grouped by country, have the votes they cast by phone, SMS or the Eurovision app tallied and mapped into the same format of 12 points for 1st place, and so on. This all culminates in a new winner being crowned, with the competition typically being hosted the following year in that country.

Features of this Website

This (unofficial) website has been developed by a small team of dedicated Digital Library researchers who also happen to be huge fans of Eurovision. We wish to share our love for the competition, and at the same time demonstrate what is possible when—harnessing some of that passion!—the techniques of Linked Open Data are applied to the Open Source Greenstone3 Digital Library platform. For the technically interested see the It All Started with a Little SparkleSPARQL below for details about how the digital library was formed.

For those who want to jump right in and access information about, as well as see and hear some of the past performances, we suggest you start by exploring the assembled information through the browsing tabs. For example:

  • Browse by countries if you want (for instance) to reminisce about songs your country have entered in the past; or
  • Browse by years if you are curious about who were the countries competing in that inaugural year of 1956.

Alternatively, use the quick-search box to query the DL collection for a term that sparks your interest. For example:

  • love and amore, or maybe something more frivolous such as la.

Or how about picking a song at random:

Data Analysis and Visualization

Loading ...

All the metadata in the digital library is simultaneously published as linked data, meaning it is possible to extract and analyze the data contained here in a variety of ways. To aid in such analysis we have added in a data visualization layer to the digital library. This is how the bar-graph above has been created, which shows how many times each country has competed, alphabetically sorted.

Through our:

we provide samples you can try out to give you an idea of the sorts of visualization that can be produced. More importantly, these samples are editable so you are free to change them however you wish. On the visualization page you'll find a sample that shows you how often different countries have won Eurovision, but perhaps you'd like to find out who has lost the most often? We also provide a sample dataflow visualization of jury voting patterns over the last decade, which makes for interesting viewing! Adjust the values used to discover how this compares with other time periods.

In addition to the visualizer, through the:

you will find a set of samples you can test-drive to give you an idea of the sorts of raw data analysis that can be done. The syntax used is called SPARQL (pronounced "sparkle"). If you are unfamiliar with this syntax, there are a variety of tutorials available online where you can learn about query language, such as the one done by Apache Jena, an Open Source initiative that provides a variety of Semantic Web and Linked Data tools. As before, these samples are editable so you are free to change them however you wish to adjust the analysis undertaken, or once you're mastered the query syntax, develop completely original forms of analysis.

We suggest starting with viewing sample visualizations to see what's possible, and making minor edits to that to adjust what is visualized. Then, if you want to start visualizing the data in a more substantially different way or else export the data for more detailed analysis under your own control, switch to the SPARQL-based data analysis page to ensure the underlying data retrieved is as you intended. Then take the newly developed SPARQL query back to the visualizer page, and through the additional text-input fields provided there, develop the visualization.

It All Started with a Little SparkleSPARQL

In terms of how this collection was developed using the Greenstone3 Digital Library (DL) architecture, we are being a touch irreverent to say it all started with a little SPARQL. It is certainly true to say that, operationally, the DL was created using SPARQL query that draws down JSON records from DBPedia about all the different entrants in the Eurovision. This is then ingested into Greenstone using its document- and metadata-processing pipeline: expand through the show more ... button below to see the actual query. But in truth, our starting point of the SPARQL query is only possible due to the Herculean efforts of the contributors to the Wikipedia pages about the Eurovision Song Contest, and following on from that the endeavors of the DBPedia project to transform a substantial portion of that information into machine-readable linked data.

Continuing the technical development of the DL, to the DBpedia extracted content, we then added in voting metadata—again using the Greenstone document- and metadata- processing pipeline—this time in the form of CSV-based spreadsheet derived from the Kaggle Eurovision Voting dataset 1975-2019.

Here's the SPARQL query that retrieves, for every year Eurovision has been held, the countries that took part. At under 20 lines of code, we think it's pretty awesome! The information retrieved includes the country, year, title of the song, and name of the entrant (the act/artist), amongst other things. All useful core information to seed the digital library collection. As the 2020 Eurovision event did not run due to the Covid-19 Pandemic, and (at the time of writing the 2021 is yet to occur), we have opted to filter the matches returned to be prior to 2020.

SELECT ?countries_in_esc_by_year ?country_in_year (?year AS ?Year) (?country AS ?Country) ?entrant (?entrant_label AS ?Creator) ?song (?song_label AS ?Title) (?was_derived_from AS ?WikipediaURL)
WHERE {
    ?countries_in_esc_by_year skos:broader dbc:Countries_in_the_Eurovision_Song_Contest_by_year.

    ?country_in_year dct:subject ?countries_in_esc_by_year.
    ?country_in_year dbp:year ?year.
    FILTER ( xsd:integer(?year) < 2020).

    ?country_in_year dbp:country ?country.

    ?country_in_year dbp:entrant ?entrant.
    ?entrant rdfs:label ?entrant_label
      FILTER (lang(?entrant_label) = 'en').

    ?country_in_year dbp:song ?song.
    ?song rdfs:label ?song_label
      FILTER (lang(?song_label) = 'en').

    OPTIONAL {
      ?song prov:wasDerivedFrom ?was_derived_from
    }
}
ORDER BY DESC(?countries_in_esc_by_year)
	    

You can try this query out yourself if you like. Select the entirety of the SPARQL query in the above text box, and press Control-C to place it in your Copy-buffer. Next visit the DBPedia SPARQL Endpoint given below, and in the main text box of the page that appears, press Control-V to paste in your SPARQL query. Finally, click on the Execute Query button to initiate the search.

Through the SPARQL Endpoint you can change the output format that is used to, for example, JSON or Turtle. For convenience, if you are just interested in seeing what the outcome of running the query is, displayed as a web page:

Triplestore Errata

The above SPARQL query is a good starting point to extract all the Eurovision entries over the years, however a more careful study of the returned results revealed a few complications that needed to be addressed. One issue stems from the fact that in its inaugural year, countries were allowed to send two entries each. For 1956, for every URI representing a country in that year there are two title and two entrants represented. As initially expressed, the SPARQL query does not cater for this circumstance and results in 2 x 2 = 4 combinations of artist and title per song.

The way to address this is to include an additional constraint that ensures that the URI representing ?song includes the relationship dbp:artist for ?entrant, effectively locking in to the artist that performed that particular song. Studying the result of this change, however, showed up a more wide-reaching problem which was that not all the ?country_year URI entries expressed relationships to songs and artists that were themselves URI: sometimes they were represented as a string literal, meaning the added constraint would fail, and reject entirely the details about a country's entry in that year. Compounding this, we also saw that some of the processing work by DBPedia to turn the manually curated information in Wikipedia into machine-readable form erroneously handled the formation of some of the song titles and artists.

Given that the erroneous entries were strings (even integer numbers at times!) and not URI gave us a way in to see how wide-spread the problem was. Using adapted versions of the the main SPARQL query we had formulated, we were able to produce lists of the affected entries. The lists are available here through the following links:

The generation of these lists also provided the key to the approach we used to compensate for the complications these issues introduced. Skipping ahead slightly to the formation of the Digital Library collection with Greenstone3, we make use of this software architecture's Triplestore Extension, which means that in addition to the main DL and Open Archive Initiative (OAI) server endpoints, there is also a triplestore backend. While the triplestore extension was designed to provide SPARQL access to the metadata and document content of the DL collections, its existence means we can include in it a graph that represents the necessary errata information we need to "course correct" the SPARQL query to perform how it is intended.

This does admittedly complicate the expression of the query, but the additions are manageable. The expanded query makes use of SPARQL's federated search feature: the query starts as before with the retrieval of triples from the DBPedia endpoint; based on resolved values of entities such as ?country_year and ?song, it then optionally retrieves matching items from the DL SPARQL endpoint. The final step is to use a conditional clause (if-statement) to test to see if the DBpedia version of the song is a literal, and if it is and if there is a bound value for the DL retrieved one, then it selects that one in preference.

The DBpedia SPARQL endpoint doesn't allow for federated queries, and so we initiate the SPARQL queries through the DLs SPARQL endpoint, using SERVICE blocks to specify the parts of the query that are run on the DBpedia endpoint.

Adding in Voting Metadata

To fulfill our vision of developing this DL collection as a rich resource through which people can explore the phenomenon we went looking for voting data that was available in a machine-readable format. We found data compiled through a manual curation process about how countries have voted going back to 1975 is available through the Kaggle website as an Excel spreadsheet.

To incorporate this as metadata into the DL, we wrote some Python code to transform the data into the internal serialized metadata format used by Greenstone. Prior to this project, the only serialized form for this was XML, which is processed by the MetadataXML plugin. As it was more convenient to generate JSON from our Python code, we took the step of adding in a new plugin to Greenstone3: MetadataJSON.

Page Scraping

Despite our best intentions work solely with machine-readable data—primarily as you have seen in the form of Linked Open Data, but also utilizing a spreadsheet of voting data—to form the Eurovision DL, in looking to expand the metadata in the DL to cover details concerning the draw position of acts, and their overall placing, we have resorted to page-scraping content from Wikipedia itself. This was because such information was not part of the entity extraction process that occurs when Wikipedia is mapped to DBpedia.

A review of Wikipedia article pages about the event in any given year showed these pages to be especially well curated, and included a table in each that listed the information we sought. While there was some variation in how this table was expressed in HTML, with a considerably portion of the heavy lifting being done by the Python library BeautifulSoup4, it was not too complex a task to develop a program that extracted this information and turned it into the newly developed Greenstone JSON metadata format.

Patching in Missing Data

Another difficulty we have encountered is that not every country who had an entry in Eurovision in a given year has its own standalone article page. This leads to missing entries in the category page for the contest in a given year, which is problematic to us, because it is this category information that we draw upon in our SPARQL query to populate the DL with all the acts.

The information about all the countries competing in a given year does, however, appear in the article page for the contest in that year. In fact it's in the same table we targeted to extract out the draw position and placement. We therefore wrote a further page-scraping program to compare the countries in that table with the countries listed on the category page for the contest in that year. For any entries we find in the table, but not in the Category page, we produce a metadata record for the DL with basic information about the entry: country, year, song title, artist, draw-position, placement, and (where available) their total score.

Comparable with the problem titles and artist/entrants, we have formulated a SPARQL query that enumerates these missing category entrants:

This all culminates in ...

All this work is put together and run through the ./PREPARE.sh script. The outcome of this is a set of source documents that can be imported into the digital library collectcion. It is instructive to run through the various steps to the prepare script and learn how it operates. For TLDR users out there, we have packaged up the output from running the prepare script as: build

The Gory Details

Viewing the collection configuration file provides a good insight into how all of these technical aspects are brought together.

Full disclosure as to how the collection all ticks is provided through our Subversion repository. Topping up our Greenstone3 code base we have: