Where do followers of Australian sport on Twitter come from?
For the past two, almost three months, I’ve been focused on Twitter to the exclusion of almost all other social networks. This is because of the size of the Twitter audience and because of the complexities of trying to determine the location of those Twitter followers. I’d really like to move on but I’ve ended up in a perfectionist loop where I can’t seem to find the command for breaking it.
My over riding goal of sorts, in regards to Australian sport on Twitter, is to determine the size and location of the Australian sport community on Twitter as expressed by following Australian sport athletes, clubs, leagues and organizations. The methodology is relatively simple: 1) Develop a list of the major Australian sport related Twitter accounts. 2) Get a list of all the followers for those accounts. The list should include as much profile information as possible, including the user inputted location and time zone. 3) Translate user inputted locations into actual locations. When no user inputted location exists, attempt to use the time zone field to get this location.
The first step is a manual step. I’ve got to look around at other people’s lists, check follow lists and develop that list. My current list is around 450. It is constantly in a state of flux as people and organizations occassionally delete their accounts or get new ones with new names. (This leads to massive drop offs that I can’t always explain or notice in the moment. Sources to document the disappearance of accounts don’t always exist. Asking a month or two after the fact means that people can’t always recall what happened.)
The second step is automated. I’ve had a friend (@Hawkeye7) develop a tool that pulls that information for me. It is relatively simple once executed but its success largely depends on the input from step one. From October 14 to October 18, we got the follower list for every single account on that list of 450 people. It took four days because there are API limits on Twitter. We’d run across them, wait an hour and then get the next page of followers or next person’s followers. In the case of @warne888, this took at least 8 hours.
The third step relies both on manual input and automation. I’ve spent probably in the neighborhood of 60 hours working on creating a list of about 85,000 variants of user inputted locations. The tasks for doing this included reverse geocoding, creating lists of city/state/country patterns for Australia based on observations for patterns that I saw as repeating, individually evaluating user inputted locations to try to make an accurate guess as to what the user meant in terms of their location. My focus was on getting as much of the city/state/country information for Australia, New Zealand, Canada, the United States and Ireland. For countries other than those listed, I just tried to identify that the account came from a country. The selection of countries for most specific location data was based on where I saw Australian sport followers as coming from and my own understanding of a country’s geography. I don’t understand the geography of South Africa and Costa Rice as much as I could. I didn’t want to spend the time to completely understand it to better label those accounts. The time to overcome my learning curve felt better used by continuing to improve the list of unknowns to relatively known. Despite the huge amount of time spent on this, I have a list of over 50,000 user inputted locations that I haven’t begun to look at. As a perfectionist, this drives me absolutely nutters. I can’t possibly translate every user inputted location into a usable location. (I’m still working on improving it anyway.) The more accurate this list, the better my results. (And given the size of the data sets involved, I have my own user input errors that I don’t always find until I run the locations.)
To give this perspective, let’s look at @warne888. He had over 198,000 followers. I found over 50,000 user inputted locations which did not have a location I had appended to that variant. This, when it was pulling from a list of 65,000 variants. India turned out to be the big problem in that I didn’t have that many variants for cities in the country. Out of a list of 50,000 followers for @warne888, 17,000 of those followers listed neither a user inputted location nor a useable time zone for determining their location. Roughly 34% of his followers I’m never going to determine where they are from.
That said, understanding the conditions of this data set, let me get to the actual results. These were found by combining all the follower results from all 450+ different accounts. By combine, only the UserID (A unique number Twitter has assigned for every account) and the country were put in the new file. The duplicate rows were then removed to insure that people who followed multiple accounts were only counted once. This should begin to give an idea as to the comparative distribution globally of Australian sport fans.
|United Arab Emirates||1325|
|Vatican City State||21|
|Papua New Guinea||16|
|Trinidad and Tobago||6|
|Bosnia and Herzegovi||4|
|Isle of Man||4|
|Turks and Caicos Isl||1|