4 November 2015 at 23:06 #2303
Cheng, Zhiyuan, James Caverlee, and Kyumin Lee. “You Are Where You Tweet: a Content-based Approach to Geo-locating Twitter Users.” Proceedings of the 19th ACM International Conference on Information and Knowledge Management. New York, NY, USA: ACM, 2010. 759–768. Web. 16 Jan. 2012. CIKM ’10.
Twitter has a massive user base of over 75 million users since its creation as a microblogging service in 2006, Due to the lack of geospatial cues it’s difficult to identify a users location, the only opportunity to insert a location is to put it on a users profile and studies show that only 26% of users from a random dataset use this option.
The framework set out in this paper aims to identify users location based on the content of their tweets. The following are three key features of the proposed approach:
• rely purely on tweet content, no need for IP information or logon details
• classification component with strong local geo-scope words
• lattice-based neighbourhood smoothing level to refine location estimate
The approach has a number of key steps which starts with researching previous work relating to estimating location of web-content. This step is well referenced and also mentioned throughout the paper within the approach. Once this section is completed the preliminary work starts which involves identifying the dataset and also explaining the algorithms to be used.
The content-based location estimation stage involves identifying key local words and with the aid of algorithms finds initial results before further estimation is applied using lattice-based smoothing. An experimental study is then used to improve the results from the initial results and although these are improved they still have only 51% estimated within 100 miles. When looking at further users with less tweets the error rate is higher.
The paper concludes that the exploration of further studies can improve results over time.
Twitter users have an option to put their location on their user profile as a geospatial feature. Few users take up this option (a random sample of one million users showed just 26% had a city name listed as their location) and this makes it extremely difficult to determine the location of users. This paper, written in January 2012 aims to overcome by predicting a users location based on the content of the users tweets. If this framework is successful it can enable content personalization (relevant advertising, related news stories etc). It can also avoid the need for private user information and sensitive data.
The authors acknowledge the following difficulties in the challenge:
• Tweets can contain details of peoples interests without reference to location
• Shorthand or non-standard vocabulary
• User’s interests may span multiple locations
• User may have more than one associated location
Before beginning with the preliminaries of the study the authors speak of related work with well referenced articles on similar types of studies and mention the possibility of related methods being of benefit to assign locations.
The preliminary stages consists of the strategies used to collect a dataset and also the filters applied to get the user accounts with locations properly submitted on their accounts. The test data is then set of 5190 users with more than 1000 tweets across the US. A set of algorithms are created to quantify the Error Distance, Average Error Distance and Accuracy.
Once these were completed, the next stage was to get a Baseline Location Estimation which involved identifying key local words for cities based on maximum likelihood estimation. The initial results were disappointing with further study on identifying local words and overcoming tweet sparsity required. Local words were then decided upon based on a high local focus with a rapid drop off as it moves away from a central point. A lattice based approach is then taken to divide the map of the United States by local words.
To overcome the tweet sparsity, Laplace smoothing was initially applied and although it was simple to implement it does not take the full geographic distribution into consideration as it treated areas nearby with no occurrences of the local word the same as cities far away. To overcome this issue a further lattice-based neighbourhood smoothing was applied using further algorithms and the final approach was a model-based smoothing used an algorithm taking the previous two stages into account. All the steps in the approach were supported with detailed diagrams and explanations to each algorithm.
The final stage was to do an experimental study on the test data, the first test showed a massive improvement on the initial results received on the Baseline Location Estimation. The encouraging results show that 51% of the test users can be placed within 100 miles of their location. The hypothesis used to explain the errors are that some users are difficult to locate based on their tweets and also that some users but the wrong location down on their account. A test was also done based on users with fewer tweets and it shows that the less tweets a user has the more errors on their locations.
In conclusion, based on the authors studies in the proposed framework they can show that the location estimator can place 51% of users within 100 miles of their actual location. They anticipate continued refinement of this approach with additional data and are also interested in further exploration of location estimation in the hope of more accurately tracking user location.
In my opinion, the framework is flawed and even with further exploration is unlikely to be successful. One of the key elements of the framework is based on local words used in tweets. These local words are based on maximum likelihood estimation and have no key science to them. An example used is the word “rockets” has a large peak near Houston due to the name of the local basketball team and also with the city being the home of NASA. Many Twitter users in the US could have a keen interest in NASA and would tweet on a regular basis with the word “rockets” in the tweet. This does not necessarily mean they are living in Houston. As pointed out within the article, a city near Houston with zero-occurrences of the word “rockets” is treated in the same way as a city far away from Houston.
The dataset used is also a specific set of users with over 1000 tweets. The framework also shows when they did look at users with lower amount of tweets the average error distance was higher. More information is leaked as a user tweets more to make it easier to locate the user. If 51% is the success rate on users with a large amount of tweets then it is unlikely the study would prove successful on a larger scale.
Although this particular framework has a low percentage success rate further exploration to develop more robust estimators to track user’s location may prove to be successful and have a higher success rate.
You must be logged in to reply to this topic.