Tag Archives: Google Docs

01 Jul

5 ways to gather data


 Caroline Beavon is a freelance information and infographics designer – get in touch for more details

linkedin


Everyone is talking about data journalism nowadays: creating maps, visualizations and infographics. However, before you can do any of that you need some DATA

Here is how I sourced the data for my Datamud project, a look at the statistics behind the big UK music music festivals.

[toc]

1. SEARCH

Official Site

The last thing you want to do is call up a press officer asking for some stats, when they are there, for all to see, on the website. Dig around in any areas labelled information, statistics, FOI and Press Area. Often companies will post useful statistics if they are often requested,but they won’t necessarily make those statistics easy to find.

The Glastonbury Festival Educational Resources area is rich with information. A series of PDF’s contain details about every element of the event – from crowd management, security, stalls, sanitation etc. As the UK’s largest festival is is often the subject of assignments and reports. This was useful as I looked for recycling information to back up the organisers claims that they are a green event.

Google

Google is a wonderful tool – it not only searches websites, but also blogs, news postings, pictures and videos. It’s well worth checking the NEWS section as someone else may have already done similar research and posted the stats online.

Unfortunately a search can return thousands of pages, so you need to be smart when submitting your search. Inverted commas around a phrase will search for those words as written, but combined with simple searches it can be a useful tool.

e.g. “were arrested” 2010

Don’t forget to check the later pages of the search too – sometimes you will find some juicy stuff buried on the less Google juicy sites.

Governing Bodies

Often Google won’t be able to pick up deep linked pages, or documents embedded or linked in pages so it’s always worth looking at official agencies and Governing bodies websites too.
Councils and the Government are now much better at archiving their agendas and minutes and whilst the search facilities are still pretty archaic and frustrating, it’s a start.

None of the various police forces websites had the crime stats that I needed, although they do often have documents that may be of use e.g. Leicestershire Police

Search / Scraping Sites

Although I did not use this during this assignment, in retrospect using a site like Scraperwiki to access data from an official site would have saved me a lot of time. I could have used it to draw together all the line ups, for example, instead of a long winded cut-and-paste process, and plenty of cleaning up.

Nowadays there are also sites that have done a lot of the work for you, by monitoring official sites and databases and turning the data into an easy to handle format.

First stop should be What Do They Know – a site geared up around FOI requests (more on this in a moment) but also you should definitely visit TheyWorkForYou (I set up an alert in regards to the Glastonbury festival, which would tell me whenever it was mentioned. My hope was that the crime levels, or crowd management would be raised at some point and reference to information given.)

Interest Sites

I mentioned Google News search above, but it’s also worth looking for sites that deal with the specific subject area. They may have useful resources but may not appear on page 1 of a Google Search.

When I was compiling lists of the bands playing the various festivals, often the official sites were clunky or the names were shown on a JPG of the official event poster. However festival news/interest sites, such as EFestivals, present the information in a more useful way

2. ASK PRESS OFFICE

For archive or very up to date statistics, often a call to the press office is necessary.

I wanted to find out more about historical weather forecasts so a visit to the MetOffice website informed me that they had a library of data that could be accessed. Within one quick email conversation I was furnished with a link to a host of archive weather data with records often going back to the 1700’sIn CSV format, these were simple to manipulate and visualise.

Press Offices are used to to dealing with requests for information, its their job, as well as being happy to help you meet deadlines.

3. FOI

FOI requests are for those tricky bits of data othat an organisation is less reluctant to send out (for time, size, sensitivity etc issues). I set ONE FOI request, for crime stats to a police force, foolishly thinking this would be quicker than contacting the press office directly. It was not.

Use these if you do not need the information urgently (it can take up to a month from start to finish)

Interesting article on FOI Requests from Channel 4

4. CROWDSOURCE

Of course carrying out ryour own research is one way of gathering data, but this project relied on the theory that “many hands make light work”.

I wanted to find out how much it would cost to see the various mainstage bands, if you were to see them on their own headline tours. I could have spent DAYS trawling the internet ticketing sites (both UK and international) collecting the data. Instead I started a public Google Docs spreadsheet. Through the social networks I encouraged people to enter the prices of tickets they had recently bought. The database was soon a third full, and a chance message from an old friend (the man behind Ents24) completed the rest by gaining access to their database.

Google Docs is a fantastic way of collaborating and getting large jobs completed.

5. I GOT MY CALCULATOR OUT

This can be hard work if you are dealing with a lot of data, but for me it was feasible

I wanted to assess the nationalities of the various bands, and compare the overall nationalties of the different lineups. This involved a lot of searches on Myspace and Wikipedia (still both very useful resources for the facts about bands) and using visualisation Software Tableau.

In retrospect I should have doubled this database up with the ticket prices one, and asked people to fill in the nationalities of the bands as well. Hindsight is a wonderful thing.

 

Want more? – DATA JOURNALISM: MORE THAN NUMBERS AND CHARTS

 

 


 Caroline Beavon is a freelance information and infographics designer – get in touch for more details

linkedin


08 Apr

MA Online Journalism: Multimedia Journalism Breadth Portfolio

Journalism – (noun) The occupation of reporting, writing, editing, photographing, or broadcasting news or of conducting any news organization as a business

This traditional definition of journalism (Dictionary.com http://bit.ly/dCdXBa) despite massive leaps forward in technology and attitude, still sums up exactly what the profession is about today: in short, getting the news out there.

Unfortunately, some are reluctant to accept these changes to the industry: old school hacks refusing to interact with readers online, newspapers not utilising video, radio stations limiting themselves to audio broadcast, whilst, behind them, there is an army of citizen reporters armed with iPhones, Youtube, Flickr, Audioboo and Bambuser ready to step in and take over the gatekeeping of the worlds news.

These are exciting times, and at the start of this educational trip into multimedia journalism, I expected to focus on video, with a brief (and required) nod towards to audio and Flash.

Little did I know.

The toughest challenge from the outset was finding inspiration for projects. As a working multimedia journalist, that decision would be handled by the News Editor, who would give you a brief and a deadline.

So, I decided to play the role of a working multimedia journalist. Switching on Sky News, I took the first story that interested me, and ran with it.

FLASH

JONNY DOREY (link to page)

Jonny Dorey is a British student, currently missing whilst studying in the USA. At this point the story lacked data (i.e. dates and times) so a simple roll-over flash animation showing the various elements of the story seemed the best option as a starting point with this media.

Ideally, with more knowledge and artistic skill, the story would have benefited from something a little more intricate (along the lines of the BBC visualization of the Jean-Charles de Menezes shooting in London). This visualization is outstanding, with the image zooming in at each stage, and moving markers to show the relevant parties. However, the level of detail here was high due to the evidence revealed in court. At the time the information regarding Jonny Dorey was scant – although since then there have been suspected sightings of him – which would have worked well on a map based animation, as well as his possible route taken. Youtube appeals, photographs and other multimedia content could then be embedded into this map. A multimedia tool like this may have been useful in spreading the word about Jonny’s disappearance, and getting people involved in the search.

FESTIVAL MAP (link)

The Jonny Dorey project broke down the story and made it easier to digest, but Flash is also a useful tool for solving problems and aiding decision making.

Over the last few years the UK has become the centre of music festivals, with hundreds happening every summer. There are also dozens of websites that claim to centralise all this information (lineups, festival dates etc.), but none of them have managed it in a clear and visual way.

A Venn diagram would have worked well in showing the overlap between different bands playing the larger festivals, but, as yet, I am unable to find such a visualization tool that will achieve this. In retrospect, a clickable map showing which bands are playing where and when, was a lot more effective.

Featuring the 6 big festivals, and just the stage headliners (a manageable number, in order to get the project completed for publication), the map allows the user to click on a band’s name listed alongside, and points would flash on the UK map, with the festival name and date of appearance.

A second tier to this festival map would have been useful, where the user could click on the Festival point on the map and be shown all the bands playing, unfortunately the map was too crowded with “hotspots” and became unusable.

However, this information is constantly being updated and this does bring up the issues of maintaining and updating Flash sites. Would it be easy to ADD to the map, or would it make more sense to make a data map instead, with the information automatically pulled in from a feed?

The Maps Channels Events site handles events on a map excellently (even thought the interface is a little basic and ugly). You can search for a date, artist or venue  – and it shows the location on a Google Map.

There is definitely scope to explore something like this, as festival websites are big business, and I could see one of the key sites, or music magazines, taking up this idea.

DATA

GLASTONBURY (link)

Glastonbury Festival is famously the UK’s largest music event, and I was keen to investigate how it has grown over the years, along with the price to attend (data acquired from official Glastonbury site)

Using Google Docs Spreadsheet and the ManyEyes visualization tool I created a scatter chart. ManyEyes has the limitations of not linking to live data, so if statistics change the data has to be re-pasted into the site but the choice of graphs and the interface made this a perfect tool for this project, and others.

I expected the chart to show a gradual increase both in capacity and ticket price, but it did flag up a drop in capacity after they took a break in 2006. It is this kind of anomaly that would work well illustrated in a timeline/chart mash-up – with landmarks in the festivals history (license issues, poor ticket sales, bad weather) – something akin to The Times Eating Chart, where the user rolls over the years, and sees the various developments.

There will be more on the issue of flawed data later in this document, but this chart does raise the issue of finances in charts over time  in relation to Inflation. How does a £8 ticket in 1981 actually compare to a £185 priced ticket today? Does this make a mockery of statistics if the price is not converted into a standard “worth”? This issue has been seen recently with claims that Avatar is the highest grossing movie of all time.

More interesting  was the comparison between the official capacity of the festival and the actual number of people attending. Glastonbury has had a  long running battle with gatecrashers (or fence-hoppers) and as a news story this is an interesting set of data.

A bar chart suited this project, with the 2 capacity figures alongside each other, and showed just how dramatic the problem of “fence hopping” has been for the festival.

Unfortunately, actual capacity stats are hard to come by (as they are tricky to monitor) so “guestimated” figures were found in news reports (e.g. BBC, newspapers) and blogs, although I accept these figures are largely speculative and may be inaccurate. An FOI request has gone into Avon and Somerset Police, who should have some official estimated attendance figures.

Using estimated and reported data for a project like this also comes with a moral responsibility. Despite recent successful measures to prevent gatecrashers, according to some reports thousands of people are still getting into the site without paying. There is constant scrutiny of the management of the festival and I did feel uncomfortable publishing speculative figures that could be taken out of context by critics (including the local council who approve the license for the event).

However, there was definitely room here to investigate any correlation between the price of the ticket and the numbers of people trying to get in for free – are people driven to jump the fence as the price goes up?

Unfortunately I simply did not have enough data (9 years worth of unofficial capacity stats) to hand to make this work effectively and will retry it if my Freedom of Information Act application to Avon and Somerset Police is successful.

ITUNES LIBRARY

As a more personal project, and to test some other charts on ManyEyes, I decided to make use of the data from my ITunes player.

By cutting and pasting the relevant columns (“artist”, “song”, “genre” and “plays”) into a spreadsheet, and using the ManyEyes Bubble Chart visualization, there was an instant display of the most played genres.

“Alternative” was the largest category – whereas most of the music I listen would fall under rock, electronic or industrial.

Tweaking the data, switching genre for artist showed that it was a classification issue, not musical taste, which had completely distorted the data. Celldweller, an industrial artist, had been categorized as alternative. I spotted the problem as I know the subject, but what about data from an external source?

How can we always trust the classification of data is correct? Even the rawest of data has still been analyzed and gone through a personal “opinion” filter. There have been examples of crime stats being skewed by personal opinion (whether it’s at face value, from the PC attending the call, or the data builder designing the charts) or even simple geography boundaries.

IAN HUNTLEY ATTACK

The recent attack on Soham killer Ian Huntley earned some interesting reaction online, with such high emotions it seems the public are still happy to see to man come to harm.

Using a Google spreadsheet and the command (=importfeed(“http://search.twitter.com/search.atom?q=huntley”, “”, “”, 20), I searched Twitter for all the tweets mentioning “Huntley” (as opposed to “Ian Huntley”, which would have limited the search to the more formal tweets from news outlets etc. “Huntley” picked up the casual, public point of view)

This created a spreadsheet of the latest  15 tweets containing the word Huntley, which were then copies into Wordle in order to create a WordCloud. This was not a particularly useful or interesting experiment, as it only highlights which words have been used the most – i.e. “Huntley” and “prison” – the more emotive words were used in smaller numbers so were not significant on the cloud.

Instead I decided to analyse how the story was being covered in 2 very different newspapers, The Guardian and the Daily Mail.

Over the past weeks I have been trying and testing several data visualization tools (Tableau, Gliffy, Graphviz) but have been taken with ManyEyes for it’s variety of charts, including analysis of TEXT

Using the Word Tree visualization, I copied the articles to analyse how the documents were structured, and which words followed HUNTLEY in the text.  The Guardian’s report followed Huntley with “convicted” “forced to fight for his life” “held at knifepoint” and several basic words whereas the Daily Mails article “was given privileges” “supposed to be under constant surveillance” “lured schoolgirls Holly and Jessica”. This text analysis is a useful tool for clearly seeing how the focus of a report is handled, especially, in this case, when the report is written from 2 different points of view.

AUDIO SLIDESHOW

Although initially reluctant to do any form of audio due to my radio background (and not wanting to stay within my field), I did decide to explore the world of audio slideshows.

There are several effective examples of this, and I was impressed by the ability to create emotion through slow moving images (e.g. Duckrabbits). However, I wasn’t personally interested in following the documentary style, instead looking into the possibility of enhancing something that would normally take a simple audio form – a music news bulletin.

With my background in radio I could quickly produce an audio bulletin, and spend the time learning about using images and transitions.

However, sourcing the images legally was of concern to me and whilst images on Flickr via CreativeCommons – is an option, most of the pictures were taken at live shows from a distance, and were not suitable for this project.

Stock photograph websites do not carry celebrity shots and official press shots are hard to come by if their star is in the news for the wrong reasons.

Unfortunately it came back to a simple Google Image search and making use of the  relevant pictures that provided.

The images had to be relatively close-up, of good quality and should supplement the story. For example the image of Pete Doherty with the policeman and Damon Albarn with the cigarette were obvious choices, considering the subject matter.

As an editor,  Windows Movie Maker offers a range of movement and transition options for the images. Movement over and between the pictures added to the story – for example, zooming in on the eyes of Robin Whitehead, the heiress and filmmaker found dead in a London flat. This gave the impression of sadness and tragedy. There was also humour by using pictures to highlight the fact that the lead singer of Killswitch Engage has the same name as 80’s pop star Howard Jones.

This process took around an hour and a half in total, from writing the bulletin to having  finished uploaded piece.

I would like to try to bring more humour into the report, along the lines of Rocketboom, otherwise this will simply be mimicking TV 60 second news style report, with images instead of video.

I would very much like to pursue this project on a regular basis (maybe even daily) but without access to good quality photographs legally, I do not believe it is possible.

18 Mar

Looks like I’m not into metal any more Toto

Data can be an interesting and eye opening thing.

I decided to cut and paste some sections of my ITunes library into Google Docs and create a data set from Artist Track, Genre and Plays.

  1. sort tracks by PLAY COUNT
  2. remove TIME, BITRATE, DATE ADDED and TRACK NO columns
  3. scroll down to the bottom of the tracks with “2” plays
  4. select every song with 2+ plays
  5. CTRL+C
  6. open a blank spreadsheet (I use Google Docs) and CTRL-V into the top left corner of the page
  7. the Itunes data appear in the Spreadsheet

Obviously this data is immediately out of date, so I am looking now into turning this into a live feed. As a PC user ITunes stats is not an option.

Points to Note

  • I often listen to Spotify instead of Itunes at home
  • I only listen to Itunes when I am working – this does not take into account Ipod plays, or CD listening in car
  • genre categorizations on Itunes can be questionable

So the first chart:

I’m not sure what I find more interesting – that metal is SUCH a tiny category (smaller than Country, worryingly) or that I seem to really like pop. I will investigate this further. Ok – a quick tweak to the options (colour to genre and label to ARTIST) showed that, phew, Ive not turned into a pop-loving indie kid just yet. It’s just that someone thinks Celldweller (industrial drum n bass noise) is alternative (see for yourself). (See, mislabelling , very deceiving)

NEXT STOP:

  • Find a way to make my Itunes data public, feed this into a live chart.
  • Create a flash animation using one of these charts, with shooty out bits that play music from that artist or genre …
  • Stop messing around with data for today and make some tea.
18 Mar

Glastonbury data mashup

NOTE: This is very much a work in progress, so any advice, feedback or tips, much appreciated! Also some of the data used is from news reports/blogs and hence is of a speculative nature but has been included for demonstrative purposes.

As part of my MA Online Journalism I have been playing around with some data from the Glastonbury festival archives.

I wanted to show the statistical history of the festival, through a visual media.

I started a spreadsheet in Google Docs and used the ManyEyes site to create my charts.

Michael Eavis officially took over the regular running of the festival in 1981 and this is where I began my research. Using official data from the Glastonbury website, I built a spreadsheet of the standard weekend camping ticket prices and official capacity (later finding this all laid out in table form on an license application PDF!)

I started by comparing ticket prices, over the years, with capacity.

Interestingly this shows a DROP in capacity between 2005 and 2007 (there was no Glastonbury in 2006).

However, this only shows the official capacity. Glastonbury festival has had a long running battle with gatecrashers (or fence-hoppers) and I felt it would be interesting to compare the actual capacity with the official one.

Unfortunately, actual capacity is hard to come by  – I gathered some information from news reports and blogs, although I accept these figures are largely speculative and may be inaccurate.

(On a personal note I was also concerned that, despite recent successful measures to prevent gatecrashers, according to some reports thousands of people are still getting into the site without paying. I am aware that there is constant scrutiny of the management of the festival and I did feel uncomfortable publishing speculative figures that could be taken out of context by critics)

I inputted the data into a scatter diagram, as above, but this did not clearly show the distinction between the 2 sets of data. I converted it into a simple bar chart which , in this case, is a lot more effective.

Although I still have some data to gather, it is interesting to see the sizeable spike in 1995, 1999 and 2000, which led to the festival being called off the following year for a “rethink”.

Next, I decided to compare the three sets of data – price, official capacity and actual capacity to see if there is was a link between the numbers of people “fencehopping” and the price of the ticket. Instead of placing all 3 data sets on one chart, I decided to create a fourth column, showing the difference between official capacity, and actual capacity.

The problem with this chart is currently the lack of data. I have plotted the years where I do not have estimated capacity, which is making the ones where I do seem dramatically out of sync. I will retry this chart once I have more data.

This is a work in progress, so any feedback or advice – much appreciated!

NEXT STOP:

  • Try Tableau
  • create a Glastonbury chart with “events boxes” that explain the data – ie NEW FENCE, bad weather, Jay-Z headline controversy etc.
  • create a word tree
  • experiment with live data

 

All content (c) Caroline Beavon 2020