Archiving Tweets: Reckoning with Twitter’s Policy

The challenge of preserving social media is an important topic in the contemporary data landscape. In the case of Twitter, millions of tweets are issued every day,  and the conversations that happen on Twitter form an essential record of our time; but like all records, this conversation can disappear if not adequately preserved. Vint Cerf from Google spoke to the media recently about the danger of a “digital dark age”, as current storage methods become obsolete. To most people, especially those working in digital preservation, this was not surprising information.

Finding sustainable, efficient ways to gather, preserve and provide access to social media archival data is the driving force behind the The Social Repository of Ireland, a joint project of Digital Humanities and Journalism group at the Insight Centre for Data Analytics and the Digital Repository of Ireland at the Royal Irish Academy. Over the past year or so, the Social Repository of Ireland has investigated the feasibility of developing an effective social media archiving tool for Twitter data relating to significant events in Ireland. During our research, we have identified some important issues that anyone thinking of setting up a Twitter archive needs to be aware of. In this article, we look at those issues, examining the historical relationship between developers and Twitter, changes to the Developer Rules over time and how other projects have fared when attempting to gather and preserve tweet data in a social media archive.

 

Developers and Twitter

Twitter makes their API available to developers to allow them build tools that work with Twitter data. Before 2011, most projects of this kind (for example, Gnip, Topsy and Datasift) operated independently, but since then many of them have become official Twitter Partners. This is a result of several changes in Twitter’s Developer Policy over the years — changes which have alternately delighted and devastated developers. The most (in)famous of these changes was in 2011, when Twitter made it a lot more difficult for third-party developers from gathering and syndicating tweet data in any meaningful way. To many, these changes were surprising, considering the relative openness and freedom Twitter had allowed developers prior to 2011.

 

2006 – 2010: the open years

Between 2006 (the year of Twitter’s birth) and 2010 a number of tools and projects, both proprietary and openly accessible, used Twitter’s API to develop scraper and aggregation tools. At that time, Twitter’s developer policy did not explicitly prohibit this kind of use. Projects such as Storify, Topsy and TwapperKeeper were launched. During this period, Twitter had a stronger focus on open data and making public tweets reasonably available. This approach was centred on the idea of Twitter content as an archive of our time: a ‘legacy approach’. It reached its zenith in 2010 when Twitter signed an agreement with the Library of Congress to archive the entire Twitterstream from 2006 onwards and for all tweets going forward. This appeared to reflect a commitment to the principles of open data and archival transparency.

 

Changes to the Developer Rules, 2011

The ‘legacy approach’ described above appeared to change somewhat in 2011 when Twitter made changes to its Developer Rules. There appeared to be somewhat less focus on making tweet data openly accessible to applications not owned by Twitter.  This change may have been brought about by the worldwide recession which was at its height at that time. While it’s not possible to say for sure what Twitter’s motivations were, it may be that the company hoped to gain new revenue streams by partnering with and monetising the various tweet scrapers and aggregators such as Topsy, TwapperKeeper, Datasift, etc. that third-party companies and programmers had developed.  Many larger tools and projects became official Twitter partners (e.g. Gnip, Hootsuite).

The text of the 2011 Developer Rules is no longer available, but the essence of the changes was that third-party apps and tools were no longer permitted to ‘replicate the core Twitter experience’. This was described in more detail by Ryan Sarver, at the time Director of Platform at Twitter:

“Developers have told us that they’d like more guidance from us about the best opportunities to build on Twitter.  More specifically, developers ask us if they should build client apps that mimic or reproduce the mainstream Twitter consumer client experience.  The answer is no,”

Sarver also made explicit Twitter’s desire to create a ‘less fragmented’ experience for Twitter users by reducing the number of ‘consumer client apps that are not owned or operated by Twitter’

Third-party developers were not explicitly barred from gathering or syndicating Twitter data but they were expected to keep within a certain size (‘size’ in this context referring to the number of user tokens needed by an app on a daily basis). The number of user tokens allowed per day varied from 100,000 to 50,000, and the new Developer Rules stated that apps wishing to extend their user tokens needed to contact Twitter to gain permission. Even then , it was not specified what exactly an app needed to do to gain permission from Twitter. The rules seemed vague, perhaps to ensure that Twitter would retain control over as many apps and tools as possible.

Realistically, Twitter were not able to shut down widely used apps such as Tweetdeck, even though they were technically violating the new Developer Rules. Instead, Twitter appeared to adopt a policy of partnership. Tweetdeck, Hootsuite, Datasift and Gnip were among the products that became Certified Product Partners.

It is possible that part of the motivation behind the new rules was the need, to some extent, for Twitter to monetise users’ tweets. Around the time of the Developer Rules change, Twitter suspended products developed by the company Ubermedia that it believed were violating its trademarks and the privacy of users. Crucially, in their takedown notice, Twitter stated, that the products were ‘changing the content of users’ Tweets in order to make money.’ Combined with new restrictions that had been placed on third-party tools and apps and the Certified Partnership Program for apps that had already passed a certain size, this focus on monetisation indicated that Twitter wished to keep financial profit from user tweets within the company itself.

Many developers were unhappy with the new rules. Some speculated that their severity would drive innovators away from using Twitter, but realistically the service remains as popular as ever, so any project that wishes to analyse data relating to news events are required to rely on Twitter’s API.

The Developer Rules have been relaxed slightly since 2011, but are still somewhat restrictive for third-party apps.  Many of the projects that shut down in 2011 did not restart. In some cases, this may have had as much to do with the separate end of funding streams as with the Twitter shutdown. Others were absorbed into larger products, e.g. TwapperKeeper into HootSuite (eventually partnered with Twitter).

Since then, data scrapers that operate commercially and behind a paywall, even ones that are not official Twitter Partners, are generally not interfered with by Twitter. Their 2014 purchase of Gnip, their largest proprietary data reseller, appears to represent a decision by Twitter to take complete financial control of their data reselling and buy the company that was already a leader in the field.

Data scraping and collection carried out by non-commercial or research tools and projects is still potentially vulnerable, as the case of ScraperWiki, a tool that allows users to build basic data scraping tools without requiring programming knowledge. Despite the fact that Twitter does allow data gathering for non-commercial purposes, the Developer Rules are remarkably vague as to what constitutes ‘syndication’ (See below for the relevant text from the Rules). Their own Twitter Archive service (developed in 2012) appears to hold the non-commercial monopoly on making ‘human-readable’ datasets available of user’s own accounts and searches. However, services such as the Internet Archive continue to make datasets available in raw unstructured format, without attracting Twitter’s ire. This is probably because the average user will have little use for unstructured data in a non-human-readable code such as JSON.

 

The current situation

According to the current (2015) Twitter Developer Policy, tools and projects may gather Twitter data but there are restrictions on what may be done with it. As has always been the case, many of these restrictions are in place to protect users’ privacy; to prevent compromise of Twitter’s product and/or to prevent the distribution of spam. However, section I, part 6 of the Policy places restrictions on the number of tweets a tool may gather and on how that information is distributed. The section states:

“If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs. You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.” (Twitter Developer Policy Section I, part 6)

This statement is vaguely contradictory – on the one hand the first part prohibits open access by users of tweet gathering tools to complete tweet data, but the second part indicates that this data may be accessed but only by ‘non-automated’ means (manually downloading spreadsheets or PDFs of data) and comprising not more than 50,000 public tweets per day. It is essentially a moderate climb-down from the 2011 rules, an acknowledgement by Twitter that social media data cannot be completely firewalled, and while it is worth Twitter’s while to attempt to streamline the user experience as much as possible, some amount of third-party development of the API is still going to happen.

 

Who made it, who didn’t

The following are some examples of tools and projects that have been banned from scraping Twitter data since 2011, and of some that survived the ‘cull’ (This list is not exhaustive). While specific reasons for shutdown or survival appear to vary, there are some common threads in many of these cases.

TwapperKeeper
http://twapperkeeper.com/index.html

A JISC-funded tool for searching, collating and exporting user tweets. It was designed for individual users and researchers and operated as Twitter archiving service from 2010-2011. In 2011 Twitter charged it with violation of their Developer Rules under the syndication of content clause. Twitter classified exporting tweets in usable format as syndication and the service was shut down. The product was absorbed into HootSuite and the open source version still provides access to unstructured raw tweet data similar to Internet Archive.

Web Ecology Project 140kit
http://www.webecologyproject.org/2010/07/presenting-140kit/

This was a project funded by Harvard University and started in 2007 with the aim of aggregating and annotating the Twitterstream for researchers to use. While it ceased operations in 2011, this may have had as much to do with its funding coming to an end as its violation of the 2011 Developer Rules. It also may have challenged Twitter’s anti-spam rules when the Web Ecology Project held a competition in 2009 inviting developers to create a Twitter spambot.

ScraperWiki
https://scraperwiki.com/

This is a more recent (2014) example of a tool being shut down by Twitter for Developer Rules violations. ScraperWiki is a data crawling and harvesting tool that allows users to export and manage social media data in easy-to-use graphics. It continues to provide its service to users collecting data from sources other than Twitter, but as of 2014 it cannot scrape and aggregate Twitter data due to violation of the syndication of content clause. ScraperWiki themselves speculated on some possible reasons for this on their blog. As well as reflecting Twitter’s increasing focus on the market, the shutdown may have been triggered by concerns about privacy, because a harvesting tool may not be able to match real-time tweet deletion. ScraperWiki also speculated that Twitter were keenly aware that there is a gap between business use of the Twitter firehose and the data-gathering needs of ordinary users. It may have targeted ScraperWiki because it saw them as filling the ‘ordinary user’ remit.

When you compare the tools and services that survived 2011, the common factor seems to be the lack of free, easy access to a human-readable presentation of the collected data. For example, ARCOMEM, a FP7-funded European Commission project geared towards using digital and web archives to enhance community memory, was not challenged, perhaps because a professional or university login was required to view the collected data and metadata. The Tweepository, a project developed by the University of Southampton as part of their ePrints digital repository, did not fall foul of the 2011 rules, also perhaps because of the ‘wall’ of a university login between the user and the data. While neither of these services charged financially, the ‘distancing’ effect of an institutional login appeared to allow them to stay on just the right side of Twitter’s vague syndication rules.

 

Conclusion

Twitter data is the social archive of the early 21st century. No archive of social media can afford to neglect looking for solutions to the problems of collecting, preserving and making this data available. When it comes to scalability, projects such as the Social Repository of Ireland, with a remit limited to one country, have a better chance of developing tools to manage data than huge-scale projects such as the Library of Congress Twitter archive. The Library of Congress also has yet to devise a workable solution for access to the Twitter archive.

However, scalability and access mean nothing if the data cannot be archived in the first place. Because Twitter is a private company, archivists and programmers are subject their developer policy. From an approach prioritising open data and shared access, the company appears to have shifted towards a more market-centred attitude in recent years. But even they could see that excessively restrictive amendments to the Developer Rules were unsustainable in the long term. Occasionally, though, they still like to exercise a little muscle, as in the case of ScraperWiki. The recentness of the Scraperwiki incident serves as a reminder to tweet collector projects to remain mindful of possible restrictions on their actions.

We needn’t throw our hands up in despair. It’s a shame that the freer, open-data approach is no longer the dominant mood at Twitter, but all is not bleak for developers wishing to find ways to gather and manage Twitter data. Certain restrictions (for example, on amounts of data gathered, and on ease of access) may have to be put in place, but ultimately there is still scope for ‘bona fide’ researchers to gain access to the incredibly rich resource that is the ever-changing Twitterstream.

#SMArchiving: International Workshop on Social Media Archiving at Hypertext 2015

Insight News Lab is co-organising a workshop on Social Media Archiving at the 26th ACM Hypertext and Social Media Conference, taking place 1-4 September 2015 at the Middle East Technical University, Northern Cyprus Campus.

The workshop focuses on the archiving and annotation of content produced on social media, and builds on our work on the Social Repository of Ireland Project, conducted with the Digital Repository of Ireland. This is a multidisciplinary workshop, which will stimulate an exchange of ideas between research and industry, including the domains of news media, digital archiving and preservation, social network analysis, semantic web and linked data, communication studies and cultural studies.

The exponential growth of social media as a central communication practice has changed the information landscape. Social media is agile when it comes to breaking news, and expansive in its ability to document social events from a wider variety of voices than traditional media. From a news media perspective, social media has been adopted as a significant source by professional journalists, and conversely, citizens are able to use social media as a form of direct reportage. Broadly, social media content now forms a significant part of the digital content generated every day, and provides a platform for voices that would not reach the broader public through traditional journalistic media alone. In this emerging environment, citizen microblogs and other user generated content constitute an important part of history and popular memory, in particular when attempting to capture significant events and the varied perspectives that accompany these events.

The flow of citizen generated reporting through social media is ephemeral and disordered; it quickly becomes inaccessible if not captured and stored in an organised fashion. This is in contrast to traditional journalism, which has well-developed archival practices, enabling researchers to reuse and rediscover content. If social media is not preserved, or if it is preserved without careful attention to subsequent access and discoverability, there is a risk of losing the diversity this rich social narrative contributes to traditional news media, and more broadly, to our socio-cultural record. Therefore, this new landscape calls for technologies and methodologies to:

1) Rapidly and efficiently capture, filter and verify content in a way that generates immediate value for journalistic and research purposes;

2) Properly annotate and archive this information for longer-term preservation;

3) Provide access to archived social media for scholarly research, and for reuse in the news life-cycle (e.g. for contextualisation, investigative reporting or comprehensive storytelling.)

This workshop addresses a variety of research questions from both theoretical and pragmatic perspectives. For example, what technologies can we use for filtering, aggregating and contextualising social media content? How can we assess the veracity of social media content and sources? What moral, legal, and ethical issues arise in social media archiving? What are the methodological considerations in organising and interpreting the social media ‘record’ of an event? What data models should be employed for archiving and preservation, and how should metadata be structured? What does this record contribute to our larger understanding of communication and media? How can rigorous archiving and preservation of social media help researchers and journalists in their work on social movements, citizen engagement, political events, and network formation? 

Topics of interest include, but are not limited to:

  • Social media archiving and annotation
  • Metadata for social media archiving
  • The role of user generated content in the news production life-cycle
  • Citizen Journalism and the Archive
  • Veracity, trust and provenance of social media sources and content
  • Event and topic detection and clustering
  • Social network and community analysis
  • Semantic Web and Linked Data technologies for archival, discovery and enrichment of social media content
  • Story curation, contextualisation and recommendation
  • Ethical challenges in archiving and broadcasting social media content

Important Dates:

Paper submission: 12 June 2015
Notification to authors: 10 July 2015
Camera-ready: 17 July 2015
Workshop date: 1 September 2015

A Call for Papers has been issued, and submissions are due 12 June 2015. All papers are peer-reviewed. For more information, see the Workshop website: SocialMediaArchiving.net

For more information on the conference as a whole, please visit the Hypertext 2015 conference website.

Twitter says YES to #MarRef: How Same-sex marriage referendum played on Twitter

On Friday May 22nd the country voted for the first ever same-sex marriage referendum. Voters were asked whether the Constitution should be changed so as to extend civil marriage rights to same-sex couples.

In the build up to the same-sex marriage referendum Twitter was one of the platforms used for exchanging ideas. We looked into Twitter conversations and tracked the hashtags which were being used during the campaigning. We have collected over half a million tweets from 28th April 2015 to 23rd May 2015. For this article we have focused on one week of the referendum and collected just under 200,000 tweets from Monday 18th May to Saturday 23rd May, 11:00 a.m. Our collection has been for the following hashtags: marref, (#voteyes and #marref), (#voteno and #marref) and #yesequality. In the case of #voteyes and #voteno, we carefully paired them with #marref to make sure we are not collecting noise tweets for any other campaigns which may be happening around the globe, with yes and no votes.

Out of the 197,186 in this period only 9,171 have #voteno and 93,747 have #voteyes or #yesequality hashtags in them. This makes it 91% Yes and 9% No hashtagged tweets out of total #voteno, #voteyes and #yesequality in Twitter conversations.

MarRef Yes No

A total number of 54,051 users have been involved in the conversation, who have exchanged 165,162 tweets. These users create 5 main clusters on users, and a number of users who would be categorised as others, who are not specifically engaged with any of the clusters.

Out of these 5 clusters 2 have a very clear and strong centre point. @YesEquality2015 is the centre of cluster one, and also the central user in the whole dataset. It has in total 4,202 tweet exchanges, with 52 outgoing and 4,150 incoming, including retweet or mention tweets.  Cluster 2 hasn’t got a clear centre as for cluster 1, but @ireland is the largest node in this cluster. It has 1,084 toal interaction, with 88 outgoing and 996 incoming, including retweet or mention tweet. The largest node in Cluster 3 is @Colmogorman with total 2,184 interactions, 219 outgoing and 1,965 incoming.

The other two clusters do not have a very distinct centre user and the conversations are more spread between various users.

MarRefNetwork23

 

We also looked into the most frequent words used in the tweets and how they evolved in time. Not surprisingly “#marref” was the most frequent word used in the tweets. This was followed by “vote”, “#yesequality” and “#voteyes”, throughout the week. On the 19th of May there was a surge of tweets with hashtag #rtept, who were mainly discussing the RTE Prime Time Debate on the same-sex marriage referendum. Towards the end of the week Twitter users started using other words more and #marref less. “Tomorrow” was a word that kept appearing in tweets on Thursday 21st, and ”today” was a leading word on Friday 22nd. Another important hashtag that started showing up on Thursday was “#hometovote”,  a hashtag used by Irish outside Ireland, who were traveling home to Ireland to vote. Their tweets and pictures became very sensational on Thursday night.  “Voted” was a word which was largely used on the day of the “#MarRef” was again the most widely used word in our dataset on Saturday 23rd, while the nation is awaiting the final results.

MarRefTimeline

Research Assistant/Software Engineer vacancy on News360 project in ccollaboration with RTÉ

The Digital Humanities and Journalism group at the INSIGHT National Centre for Data Analytics @NUI Galway is seeking an outstanding candidate for a Research Assistant or Research Associate position in the Social Semantic Journalism realm. The successful candidate will work on a project in collaboration with the Irish national broadcaster, RTÉ, on ‘social news detection and contextualisation’. The project is jointly funded by RTÉ and the Science Foundation Ireland.

The successful candidate will be focusing on software development and UX design. She/he would have a passion for Social Media mining, Linked Data and the News and Media industry.

Essential criteria for the applicant:

  • Bachelor’s degree in Computer Science, Informatics, or relevant subjects.
  • Strong Java or Python programming skills.
  • Strong front-end development and UX design skills.
  • Minimum two years software development and UX design experience.
  • Proven ability to work independently or in a team environment.
  • The applicant should be creative and enthusiastic, with excellent communication skills.

Desirable – It is desirable that applicant will possess:

  • A Masters or MPhil degree in Computer Science, Informatics or relevant subjects.
  • A Masters degree in Journalism, Communication, Political Studies or relevant subjects.
  • Familiarity with the news and media industry, knowledge of the news production process and preferably having worked in/with such organisation.
  • Experience of developing applications for news and media industry.
  • Experience in / familiarity with the following is highly desirable:
    • Semantic Wen and Linked Data Technologies
    • Social Network Analysis
    • Stream Processing
    • Digital/Online/Citizen Journalism
    • Natural Language Processing
    • Text Mining
    • Data Visualisation
  • A good record of theoretical and applied research on the Social Web and the Semantic Web, warranted by a scientific publication record in workshops, conferences, journals and book chapters and/or awards in national or international challenges.
  • Excellent interpersonal communication and scientific writing skills.

Salary range:

Research Assistant: €25,425 – €32,173 per annum.
Research Associate: €37,750 – €40,000 per annum.

This post is available immediately and is fixed term until the end of September 2015.

For informal discussion about this post please contact: Bahareh.Heravi@insight-centre.org.

To Apply: Applicants should include a cover letter, curriculum vitae, a cover letter and the names and addresses of at least three referees, via email (text, postscript or PDF only) to: hr.ie@deri.org

Final date for accepting applications is 29 April  2015, however we will close the position as soon as it is filled.

First National Survey on Irish Journalists’ use of Social Media

After collecting and analysing data from hundreds of professional journalists working in Ireland, HuJo is releasing a comprehensive report on Irish Journalists’ use of Social Media. This is the first survey of such undertaken in Ireland and is being launch on the 7th January 2015 at the “Citizen Journalism and Social Media Archiving” minitrack of the  48ht Hawaii International Conference on Social Sciences (HICSS 48). HICSS is #1 Infirmation Science conference in terms of citations according to Google Scholar. Dr. Bahareh Heravi and Dr. Natalie Harrower, two of the authors of the survey report, co-chair the minitrack on “Citizen Journalism and Social Media Archiving”.

The survey was open to all professional journalists working in Ireland, and was distributed widely to attract the broadest possible set of responses. The survey poses a wide variety of questions to journalists, in an effort to reveal how journalists integrate social media into their workflows, how they perceive the information they find through social media, and what steps they take to investigate a social media sources’ validity.

Overall, the survey reveals that Irish journalists have integrated social media into their journalistic practices quite heavily: 99% of Irish journalists use social media, with half of those using it daily. It further reveals that while the the most common use for social media between Irish Journalists is sourcing. On a daily basis 58% of Irish Journalists use social media for finding news leads and 49% use it for sourcing content. Despite the wide use of social media by Irish journalists, a considerable number of them have concerns over veracity of information on social media and believe that without external verification, the information from social media cannot be trusted. Very few journalists use specialist tools to validate information, instead relying on the practice of contacting individuals directly.

The survey report provides detailed nformation on the ways Irish journalists use social media and compare their use in various aspects, and on various factors. You can download this comprehensive report  to find much more interesting information on details of how Irish journalists use social media.

Download the Irish Social Journalism report

Research Assistant Vacancies in Social Media Mining for Social Semantic Journalism

Applications are invited from suitably qualified candidates for two Research Assistant positions in social media mining in social semantic journalism realm. The successful candidates will work within the Digital Humanities and Journalism group at the Insight Centre for Data Analytics @NUIG (formerly know as DERI), on a project on ‘a socially aware multimedia enrichment solution for journalists and newswire subscribers’, funded by the Science Foundation Ireland.

The Insight Centre for Data Analytics at NUI Galway hosts one of the most internationally-recognised Linked Data research groups in the world, and is dedicated to research aimed at enabling Networked Knowledge using Semantic Web technologies.

The successful candidate will conduct research on the area of social media mining, natural language processing and social event detection and classification. She/he will also be expected to play a strong development role in the project.

Essential criteria for the applicant:

  • Bachelor’s degree in Computer Science, Informatics, or relevant subjects.
  • Strong Java or Python programming skills.
  • Minimum two years software development experience.
  • Knowledge of software design, development and maintenance processes.
  • Strong background in front- and backend Web application development.
  • Have experience in some, if not all, of the following:
    • Social Media Mining
    • Natural Language Processing
    • Social Network Analysis
    • Semantic Web and Linked Data technologies
    • Stream processing
  • Proven ability to work independently or in a team environment.
  • The applicant should be creative and enthusiastic, with excellent communication skills.

It is desirable that applicant will possess:

  • A Masters or MPhil degree in Computer Science, Informatics or relevant subjects.
  • A Masters degree in Journalism, Communication, Political Studies or relevant subjects.
  • Experience working in industry and developing commercial software products.
  • Experience in / familiarity with the following is highly desirable:
    • Natural Language Processing
    • Text Mining
    • Data Visualisation
    • Data Verification
  • A good record of theoretical and applied research on the Social Web and the Semantic Web, warranted by a scientific publication record in workshops, conferences, journals and book chapters and/or awards in national or international challenges.
  • Excellent interpersonal communication and scientific writing skills.

Salary range: €23,876 – €32,930 per annum.

This post is available from 1st February 2015 and is full time for a fixed term for 12 months.

For informal discussion about this post please contact: Bahareh.Heravi@insight-centre.org

To Apply: Applicants should include a cover letter, curriculum vitae, a list of publications, a research statement and the names and addresses of at least three referees, via email (text, postscript or PDF only) to: hr.ie@deri.org

Please include Ref: NUIG 112-14 in the subject of your email.

 

Final date for accepting applications is  29 April 2015, however we will close the position as soon as it is filled.

 

About NUI Galway

The National University of Ireland, Galway is home to more than 15,000 students across five Colleges with highly active agendas in teaching and research.

The Insight Centre for Data Analytics is a joint initiative between researchers at NUI Galway, University College Dublin, University College Cork, and Dublin City University, as well as other partner institutions. It brings together a critical mass of more than 200 researchers from Ireland’s leading ICT centres to develop a new generation of data analytics technologies in a number of key application areas.

The €75m centre is funded by Science Foundation Ireland and a wide range of industry partners.  Insight’s research focus encompasses a broad range of data analytics technologies and challenges, including machine learning & data mining, media analytics and optimisation and decision analytics, personalisation and recommender systems, the Semantic Web and Linked Data and the sensor web. And together with more than 30 partner companies Insight researchers are solving critical challenges in the areas of Connected Health and the Discovery Economy.

National University of Ireland, Galway is an equal opportunities employer.

 

Social Media Hoax fails to derail election bid

Social media has become an important source for news stories; the combination of 24hrs communication, direct access to sources and the ever-present risk of a gaffe or mistake makes it a great asset for the clued-in newsroom. But the growth in sources and channels, not to mention the prevalence of Photoshop and its like, means that hoaxes and misinformation compete for attention with genuine and useful content. Thus, verification is an ever-expanding problem for those who want to draw on social media as an information source.

Thankfully however, just as information is getting crowdsourced, so is verification becoming a collective endeavour. Irish politics recently saw the confusion and embarrassment that can be caused by a social media hoax, as well as the power of crowd-sourced debunking to expose the hoaxers.

Paul Murphy TD, of the Socialist Party

Paul Murphy TD, of the Socialist Party

The weekend of the 11th and 12th October saw Paul Murphy of the Socialist Party win a closely-fought by-election in the Dublin South-East constituency. Murphy beat the favourite, Sinn Fein’s Cathal King, to win the seat, focusing on water charges as the core political issue.

Although Sinn Fein has emerged as the leading opposition party in recent months, it was challenged on its left flank by the Socialist candidate. Sinn Fein has successfully positioned itself as the primary opposition to the Government, but, as analysts have argued, left itself open to criticism from the left in doing so. While the parties to the left of Sinn Fein are significantly smaller, they have achieved limited success by advocating strong resistance to unpopular policies, such as the incoming water taxes.

Murphy’s election campaign focused heavy criticism on Sinn Fein for what he sees as half-hearted opposition to water taxes, a big issue in the working class district. Sinn Fein have promised to abolish the charges if they are elected to Government, but have not committed to supporting the non-payment strategy that the Socialist Party champions. Sinn Fein representatives have angrily rejected this criticism and accused the Socialist Party of being dishonest.

It was therefore with great delight that prominent Sinn Fein representatives, such as Deputy Leader Mary Lou McDonald, shared an image that showed Murphy admitting that his criticisms were dishonest and that he was just playing politics.

Fake Paul Murphy conversation

The fake Paul Murphy conversation

But others found the image less easy to believe. Many questioned its authenticity and called for more information. Paul Murphy quickly used Facebook to disclaim the conversation, saying:

“A fake screengrab of a conversation that I supposedly had with a SFer is being circulated at the moment by prominent SF members, including Lynn Boylan.

It is a fake, appears to be a photoshop. Extremely low dirty tricks that should be withdrawn immediately. Please share widely.”

Some quick searches by interested social media users found that the source of the image, Jason Roe, was almost certainly a Sinn Fein member or supporter; his Cover Image on Facebook featured a mural of party leader Gerry Adams, as well as comments complaining about Socialist Party leaflets. Amidst the searches, it was also clear that he was in the middle of deactivating or changing the privacy settings on his Facebook as the account became temporarily, and later permanently, unavailable.

There was no response for requests for the original image to be made available so it could be checked to see if it had been altered (Facebook removes information from images that can be useful for finding evidence of manipulation in programs like Photoshop). Most tellingly of all, it was clear that the ‘conversation’ did not come from either of Paul Murphy’s personal or political accounts, as it featured the name ‘Paul Murphy’ of his personal account and the profile image of his political account, ‘Paul Murphy, Socialist Party’.

When curious members of the public identified these issues, Murphy’s side drew on their work and took the fight to Sinn Fein representatives, demanding that they admit the mistake. Eventually the representatives deleted the photo and comment threads from their Facebook profiles. Jason Roe, the source of the hoax, responded to Paul Murphy, admitting that the conversation was fake and claiming that it came from a fake profile of Murphy. He further claimed that he had deleted the image and the conversation, and that the fake profile had disappeared.

The eventual admission of the hoax by the source

The eventual admission of the hoax by the source

Regardless of whether it was Photoshopped or a hoax account as claimed above, it was clearly a shoddy smear attempt. By the evening, the party representatives who had shared the image apologised and Murphy was given plenty of chances to recapitulate his criticisms of Sinn Fein on air and in print.

In this case, the hoax was exposed before it could even get to the mass media, but journalists would do well not to count on a social media ‘self-correction’ effect occurring so rapidly. As this example shows, information coming from social media has to be vetted carefully, looking at both the information content and the originating source. In this case, attention to both of these aspects led to the screenshot’s rapid discrediting and embarrassment for those drawn in by it. In general, it’s worth remembering that if something seems too good to be true, it might well be false.

Twitter as an Open Source Newsroom: An Interview with Andy Carvin

Last September Fergal Gallagher was able to interview Andy Carvin just before he gave his talk at the Truth in News Symposium held at Dublin City University. We have transcribed the audio from his interview and the original recording is available at the bottom of this page.

Fergal Gallagher: I am here with Andy Carvin, Senior Strategist and NPR, National Public Radio. (NB: Andy has since moved to First Look Media.) We are at the DCU media conference. Andy hasn’t spoken yet but his name has already come up quite a lot for the way he uses social media to cover stories. Maybe you could talk about what you do.

Andy Carvin: My primary job at NPR, essentially, is to experiment with new ways of reporting. Especially when it comes to collaborative reporting with the public…Over the years I have done a variety of experiments on Twitter… but during the Arab Spring things really came together because we reached a critical mass of people who were using social media in certain parts of the world and were willing to serve as eye-witnesses to these events.

…I use my Twitter account, essentially as an open source newsroom. Rather than being a newswire service where I am constantly saying, “This has happened. This hasn’t happened.” blah, blah, blah, blah, I am asking people, “What do you know about this? Have you heard about this? Was there actually a chemical attack yesterday? Are there any videos yet? How do we know those videos aren’t from somewhere else?”

And so my Twitter followers will work with me to investigate certain things that we think are interesting. Often, the stuff we do are ideas they come up with. So, for example, during the Arab Spring a number of news outlets in the region started to report that Israel was supplying weapons to Muammar Gaddafi which seemed a little insane at the time given that they were arch-enemies. But, nonetheless, it was being reported.

They were reporting it because they found a mortar shell that had what looked like a Star of David on it, a six pointed star. Well, in less than an hour my Twitter followers were able to prove that’s a standard symbol that has been used for over a hundred years to mark these types of shells as star shells that you shoot up to light up the sky. They are illumination rounds so had nothing to do with Israel. The same symbol could be found on mortars and artillery from World War 1. So, it is things like that, we stumble on these questions that are being reported. One question leads to another and we just dig in further.

FG: On that point there is a guy in the UK who does something similar on Syria.

AC: He goes by the name of Ron Moses but his real name is Elliot Higgins. He is a fascinating guy because he was just some unemployed bloke who had a lot of free time on his hands. So he started paying attention to Syria and he has become the civilian expert on arms in and out of Syria. It’s incredible what he’s done. So, I’ve worked with him on a few occasions and we swap information.

FG: It’s basically a new way of gathering news that traditionally was done behind closed doors whereas you are making, as you say, an open, public newsroom. I know you have had some criticism from traditional journalists who say you tweet something or ask someone, “Is this true?” And it later turns out to be false, whether it’s someone else or your followers who have proved it to be false. What do you say to people who say, “You shouldn’t be tweeting that if you are not sure if it is true,” in case people don’t see the correction?

AC: Well, I got that a lot after the Newtown massacre, at that elementary school in Connecticut. Michael Wolf wrote a fairly scathing editorial about me for The Guardian but I wrote back and paragraph by paragraph I told him he is completely misunderstanding what I do. In fact, the examples he cited of where I was sharing rumours I was actually sending out tweets reporting what U.S. broadcasters were claiming on air and asking what evidence people knew about them. So, at one point there was a report that a purple van had been surrounded by police and I hadn’t seen that reported anywhere else but one of the U.S. networks cited it. So I asked people, “What do you know about this?” We tried to figure it out.

In the context of how I work it makes sense for the people that follow me. And the folks that have criticised me, they understand that but don’t feel comfortable with it. I can’t force them to change their outlook on how journalism and reporting happens. The only way I can judge it is if the news organisation I am working for is happy with how I am doing it and if our ombudsman has any problem with how I am doing it. In each case the response has been very positive.

FG: Since you have been doing this there has been many more journalists, and non-journalists, who have been copying that technique.

AC: It is becoming more common in different ways. People may not do it full time for their reporting but there are times when the public knows more than you do so why not ask them for help. I certainly wouldn’t want to pretend that this the way we should be reporting in all circumstances because probably the majority of the time you are reporting on other stories it may not be appropriate.

FG: I guess a key thing with the Arab Spring was that you didn’t have foreign correspondents on the ground.

AC: The two things that really went into play there was first in many cases you didn’t have foreign correspondents on the ground and that was especially true for Libya early on and for a lot of Syria. But on the other hand you had people living there who were willing to capture what was happening and upload it or share it some form.

If you can find a critical mass of these people and cross-reference what they’re saying, somewhere in the middle of that is the truth. So, if there is a large protest and, let’s say, shooting starts to happen, one person may report it on Twitter but they are going to have a very limited field of view around them and they may not totally understand what had just happened. Whereas if I have been able to identify thirty people across that same area, spread out all over, and then monitor what they are saying simultaneously you can, in some way, triangulate the truth from that.

FG: News budgets are more and more limited so do you think your technique could be a future for news? Instead of sending someone out who is expensive you do this kind of reporting.

AC: I really hope not because it is a completely different style of reporting. Some of the most powerful reporting that has come out of Syria has been when reporters have gone in and been able look people directly in the eye to talk about relatives that have died or have been wounded and to be able to put together the context of that in a broader sense. And that can be very hard to do remotely. I would hate to see my methods used as an excuse to cut back on foreign reporting or any type of reporting for that matter. I think they complement each other very well.

At one point, I ended up going to Cairo for an event. There was a big altercation in Tahrir Square that evening that came out of nowhere. They hadn’t had one for several months at that point.

So, I went with a small group of protestors and they brought me as close as they could get without bringing me into the square as they were a little paranoid about my safety. We are surrounded by police. There is tear-gas going off and people with blood running down their hair. So, I am able to observe all these things going on in my immediate vicinity but I still real didn’t have an understanding of what was going on. I didn’t know what was happening at the centre of Tahrir or on the other side of it. And the only way I was personally able to fill in those pieces was when I was able to get a signal on my phone again and could look on Twitter and see what all my contacts were saying.

So I end up becoming one node out of many that paints a bigger picture. But at the same time, one correspondent at that location serving as that single node can still tell extremely vivid and powerful stories.

FG: It’s funny that although you were there you were still doing journalism as if you were sitting at home in the U.S.

AC: It is funny there are times that I am traveling and there is something happening in that city and I almost would rather be back at my hotel working on it because there is only so much I can do caught in the middle of it.

FG: Where do you see things going in the immediate future? Will there still be print in five or ten years?

AC: It is phasing out in different places at different speeds. I am never comfortable making a prediction with these things. Many newspapers in the U.S. have either switched entirely to digital or have gone out of business or have cut back the amount they are producing on paper. So, the trend isn’t favourable for them but at the same time they are beginning to do things online that are profitable.

I’ve never been a fan of pay-walls myself but the New York Times is making a profit on their pay-wall. So, there are different economic models.

I think the key point here is that journalism isn’t dying. It’s the economic models that are changing and in that process some of those economic models are going to fade away and they just happen to be ones that we cling onto very closely because they have worked so well for a very long time. And in some cases we haven’t planned for that transition very well. But I think there are some papers that will continue to do what they are doing because they have an audience that really loves what they do.

As long as the subscribers are willing to pay and advertisers are willing to pay to fill a demographic we’ll have a New York Times and an Economist and others. Whatever format it is in people will use it because they value the content. So the key thing is differentiating yourself with the content.

This article was originally published at Technology Voice.

Banner and top picture By Ahmed Abd El-Fatah from Egypt [CC-BY-2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons.

Main body picture By Lilian Wagdy (DSC_9315 Uploaded by The Egyptian Liberal) [CC-BY-2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons.

HuJo wins prize at BBC #newsHack II

A team from the Digital Humanities and Journalism group at Insight @NUIG won the ‘Connecting the News’ category prize at the BBC NewsHack on May 1st-2nd. The team developed Hash2News, a Chrome Extension which enables users to find the news stories behind Twitter hashtags.

The #newsHack is an initiative of the BBC NewsLabs innovation programme, and organised by BBC Connected Studio and the Global Editors Network (sponsored by Google) and aims to foster digital innovation in news. HuJo-Insight competed alongside teams from other academic institutes as well as news organisations such as Sky News, the Financial Times, Storyful and the BBC.

The Hackathon saw some great products – from Storyful’s Chrome extension for assisting journalists collate cross-media news stories, to the visually beautiful BBC Brief, which optimises news articles for mobile devices by displaying them in minute, screen-sized pieces.

working

Team HuJo working hard

Inspired by the belief that a hack should identify and solve a particular problem, the HuJo-Insight team decided to use their expertise with handling Twitter streams and entity extraction to find the news articles most relevant for any given hashtag. Social media, especially Twitter, presents a large stream of discussion to users, often informed by external news events. The result is that users often feel like they’re ‘out of the loop’, and want to find out what is behind ongoing social media discussions. By providing a direct link from Twitter content to relevant news articles, HuJo’s Chrome extension enables Twitter users to find ‘the news behind the noise’; the news articles relevant to social media conversations.

 

twitter_zoom

Extension inserts a search icon next to each hashtag in the Twitter feed

The extension works by utilising Twitter’s Streaming API and extract named entities from hashtag streams, then matching them with BBC news articles via the BBC NewsLabs semantic API, which makes queries against semantically tagged news articles – so that the tags indicate what the article is about.

Hash2NewsArchitecture

Hash2News Architecture

All this happens under the hood, however; any user of the HuJo Chrome extension will simply see a search icon placed next to each hashtag on their Twitter stream; clicking on this redirects the user to a custom page that displays and links to relevant news articles.

news_results

Clicking a search icon returns news articles related to that hashtag.

The judges saw the utility of this product and awarded the team the prize for ‘Connecting the News’, which requires the team to ”pique audience interests, to tap into social media habits, and support consumption across devices.” The other category winners were BBC Location Service (Explaining the news), The Independent (Tools for Journalists), Sky News (Theming the News), University of the West of Scotland (NewsCrack award) and BBC archives (Visually Inspired). The Best in Show winners were The Financial Times (Glasgow) and The Times/Sunday Times (Dublin).

Hash2NewsPresentation

Ravi and Dara presenting Hash2News

The HuJo team plan to finalise their extension and make it freely available online.

Here you can see a curated Storify of the event by one of our team members Ravindra. You can also follow the hackathon feed here.

Citizen Journalism and Social Media Archiving minitrack at HICSS 48

We, in collaboration with the Digital Repository of Ireland, are organising a minitrack on Citizen Journalism and Social Media Archiving as part of Digital and Social Media track of HICSS 48.

Call for papers:

HICSS Citizen Journalism and Social Media Archiving minitrack
http://www.hicss.hawaii.edu/HICSS_48/Tracks/DSM/DSMCitizen.pdf

Hawaii International Conference on System Sciences (HICSS) 48
January 5-8, 2015, Grand Hyatt, Kauai, Hawaii

PAPERS DUE: June 15, 2014 via the HICSS conference system
http://www.hicss.hawaii.edu/hicss_48/apahome48.htm

 Hashtag: #hicss_sja

The exponential growth of social media as a central communication practice, and its agility in announcing breaking news events more rapidly than traditional media, has changed the journalistic landscape: social media has been adopted as a significant source by professional journalists, and conversely, citizens are able to use social media as a form of direct reportage. Social media content now forms a significant part of the digital content generated every day, and provides a platform for voices that would not reach the broader public through traditional journalistic media alone. In this emerging environment, citizen microblogs and other user- generated content constitute an important part of history and popular memory, in particular when attempting to capture significant events and the varied perspectives that accompany these events.

The flow of citizen generated reporting through social media is ephemeral and disordered; it quickly becomes inaccessible if not captured and stored in some way. This is in contrast to traditional journalism, which has well-developed archival practices, enabling the news production life cycle to reuse and rediscover content. This new landscape calls for technologies and methodologies to 1) rapidly and efficiently capture, filter and verify content in a way that generates immediate value for journalistic purposes; and 2) properly annotate and archive this information for longer-term preservation, access and reuse in the news life-cycle (e.g. for contextualisation, investigative reporting or comprehensive storytelling). If not preserved, or if preserved without careful attention to subsequent access and discoverability, there is a risk of losing the diversity this rich social narrative contributes to traditional news media.

In this minitrack we are interested in addressing a variety of research questions from both theoretical and pragmatic perspectives. For example, how we can best utilise social media for news production? What technologies can we use for breaking news detection, filtering, aggregation and contextualisation? How can we assess the veracity of social media content and sources? What moral, legal, and ethical issues arise when professional journalists use social media as a source? How can we organise, interpret, and retain a record of social media around news events? What does this record contribute to our larger understanding of news, and the writing of news? How can rigorous archiving and preservation of social media help researchers and journalists in their work on social movements, citizen engagement, political events, and network formation?

In Summary, this multidisciplinary minitrack focuses on the areas of citizen/social journalism and social media archiving, which pose distinct, yet complementary, research challenges. By pairing these topics in one multidisciplinary minitrack we hope to stimulate an exchange of ideas between multiple domains of research and industry, including news media, digital archiving and preservation, social network analysis, semantic web and linked data, communication studies and cultural studies. To this end we welcome papers in either of these two areas or papers that address their intersection.

Keeping the application domains of the minitrack in mind, topics of interest include, but are not limited to:

  • The role of user generated content in the news production life-cycle
  • Veracity, trust and provenance of social media sources and content
  • Approaches to archiving social media content
  • Event and topic detection and clustering
  • Information and entity extraction
  • Social network and community analysis
  • Semantic Web and Linked Data technologies for archival, discovery and enrichment of social media content
  • Opinion mining and sentiment analysis
  • Story curation, contextualisation and recommendation
  • Ethical challenges in archiving and broadcasting social media content

 For more information please visit the HICSS minitrack web page on http://www.hicss.hawaii.edu/HICSS_48/Tracks/DSM/DSMCitizen.pdf.

ORGANISERS

Bahareh Heravi, Insight Centre for Data Analytics at NUI Galway (formerly known as DERI)
Email: Bahareh.Heravi@deri.org  (Primary Contact)
Twitter: @Bahareh360

Natalie Harrower, Digital Repository of Ireland
Email: n.harrower@ria.ie
Twitter: @natalieharrower

Stefan Decker, Insight Centre for Data Analytics at NUI Galway (formerly known as DERI)
Email: Stefan.Decker@deri.org
Twitter: @Stefanjdecker

ABOUT HICSS CONFERENCES

HICSS conferences are devoted to the most relevant advances in the information, computer, and system sciences, and encompass developments in both theory and practice. Accepted papers may be theoretical, conceptual, tutorial or descriptive in nature. Those selected for presentation are included in the Conference Proceedings published by the IEEE Computer Society and maintained in the IEEE Digital Library.

How to Submit a Paper:  Follow Author Instructions on the conference web site:  http://www.hicss.hawaii.edu/hicss_48/apahome48.htm

HICSS papers must contain original material.  They may not have been previously published, nor currently submitted elsewhere.  All submissions undergo a double-blind peer review process. Abstracts are optional, but recommended. You may contact the Minitrack Chair(s) for guidance or verification of content.

Submit a paper to only one Minitrack.  If a paper is submitted to more than one minitrack, then either paper may be rejected by either minitrack without consultation with author or other chairs. If you are not sure of the appropriate Minitrack, submit an abstract to the Track Chair(s) for determination, and/or seek informal opinion(s) of Minitrack Chair(s) before submitting.

Do not author or co-author more than 5 papers.  This means that an individual may be listed as author or co-author on no more than 5 submitted papers.  Track Chairs must approve any names added after submission or acceptance on August 15.

IMPORTANT DEADLINES FOR AUTHORS FOR HICSS 48

June 15, 2014   –   SUBMIT FULL MANUSCRIPTS FOR REVIEW as instructed. The review is double-blind; therefore, this initial submission must be without author names.

Aug 15, 2014   – Review System emails Acceptance Notices to authors. It is very important that at least one author of each accepted paper attend the conference. Therefore, all travel guarantees – including visa or your organization’s fiscal funding procedures – should begin immediately.  Make sure your server accepts the review system address https://precisionconference.com/~hicss.

Sep 15, 2014  – SUBMIT FINAL PAPER.  Add author names to your paper, and submit your Final Paper for Publication to the site provided in your Acceptance Notice.  (This URL is not public knowledge.)

Oct 1, 2014   – EARLY REGISTRATION FEE DEADLINE. At least one author of each paper should register by this date in order secure publication in the Proceedings.  Fees will increase on Oct 2 and Dec 2.

Oct 15, 2014   – Papers without at least one paid-in-full registered author may be deleted from the Proceedings and not scheduled for presentation; authors will be so notified by the Conference Office.