Microposts are a highly popular medium to share facts, opinions or emotions. They are an invaluable wealth of data, ready to be mined for training predictive modelings. This year the #Microposts 2014 Workshop will host an "Named Entity Extraction and Linking Challenge (NEEL)".
The overall task of the challenge is to automatically extract entities from English microposts, and link them to the corresponding English DBpedia v3.9 resources (if the linkage exists). As linking stage we aim to disambiguate expressions that are formed by discrete (and typically short) sequences of words.
Existing entity linking tools are intended for use over news corpora and similar document-based corpora with relatively long length. We organise this challenge to foster research into novel, more accurate solutions for the automatic entity linking in (much shorter) micropost data.
We will ask the participants to automatically extract entities (e.g., Obama, London, Rakuten) belonging to all entity types (e.g., Person, Location, Organisation) from a collection of microposts. Participants will have to automatically provide context-relevant DBpedia resources for each entity in a micropost.
The dataset comprises of 3.5K tweets extracted from a much larger collection of over 18 million tweets. This collection, provided by the Redites project, covers event-annotated tweets collected for the period of 15th July 2011 to 15th August 2011 (31 days). It extends over multiple noteworthy events including the death of Amy Winhehouse, the London Riots and the Oslo bombing. Since the task of this challenge is to automatically extract and link entities, we have built our dataset considering both event and non-event tweets. While event tweets are more likely to contain entities, non-event tweets enable us to evaluate the performance of the system in avoiding false positives in the entity extraction phase.
The dataset has been split into a training (70%) and testing (30%) sets. Following the Twitter TOS we will only provide tweet IDs and annotations for the training set; and tweet IDs for the test set. We will also provide a common framework to mine these datasets from Twitter.
The training set will be released as tsv file where each line consists of :
Tokens are separated by TABs. Entity mentions and uris are listed according to their appearance order in the tweet.
We will timely advertise the release of the data sets on the workshop mailing list. Please subscribe to the #Microposts2014 google group.
Download the data
Evaluation of accuracy
The evaluation consists of two separated stages:
- Paper peer review : A community of experts of the domain will judge the quality and applicability of the approaches taken, to provide useful insights on your research;
- Precision and Recall: F1 (F-measure with beta = 1) will be computed on a gold standard manually created from the test set. The automatically extracted entities and links will be both matched against this ground truth.
All submissions will be only ranked according to the F1 of each best submission.
Submissions should be provided as a zip file using your system name as the file name (e.g. 'awesome.zip'), containing:
- a TSV file with your system name (e.g. 'awesome.tsv'). We accept up to 3 different submissions, and we will consider *only* the best. If you do so you must specify clearly in your paper the modifications applied to each labelled submission. In this case the submission should contain each of up to 3 TSV files with the tool/system name with "_n" appended to each (e.g. awesome_1.tsv, awesome_2.tsv, awesome_3 ). In order to evaluate your submissions we require you to submit a tsv file following the format in which the training set is provided.
- a paper of 2 pages describing your approach and how you tuned/tested it using the training split. All submissions must be in English. Submissions must be in PDF formatted in the ACM SIG Proceedings format. All submissions are not anonymous. Please send us your submission before the deadline through Easychair. All accepted submissions will be invited for short presentations during the #Microposts2014 workshop and will be published independently from the workshop proceedings on the challenge page and on CEUR (note that a minimum number of papers should be submitted in order to be able to publish them on CEUR).
- Intent to participate:
13 Jan 2014(soft)16 Jan 2014
- Release of training set:
14 Jan 201417 Jan 2014
- Release of test set:
1721 Feb 2014
- Challenge Submission deadline:
2126 Feb 2014
- Challenge Notification: 14 Mar 2014 (hard)
- Challenge camera-ready deadline: 24 Mar 2014 (hard)
- Workshop program issued: 15 Mar 2014
- Challenge proceedings to be published via CEUR
- Workshop - 07 Apr 2014 (Registration open to all) (All deadlines 23:59 Hawaii Time)
A. Elizabeth Cano, Aston University, UK
Giuseppe Rizzo, Università di Torino, Italy
Challenge Dataset Chair
Andrea Varga, The University of Sheffield, UK
Evaluation CommitteeEbrahim Bagheri, Ryerson University, Canada
Pierpaolo Basile, University of Bari, Italy
Óscar Corcho, Universidad Politécnica de Madrid
Leon Derczynski, The University of Sheffield, UK
Guillaume Erétéo, INRIA, France
Miriam Fernandez, The Open University, UK
Andrés García-Silva, Universidad Politécnica de Madrid
Anna Lisa Gentile, The University of Sheffield, UK
Robert Jäschke, University of Kassel, Germany
Diana Maynard, TheUniversity of Sheffield, UK
José M. Morales del Castillo, El Colegio de México, Mexico
Georgios Paltoglou, University of Wolverhampton, UK
Bernardo Pereira Nunes, PUC-Rio, Brazil
Daniel Preoţiuc-Pietro, The University of Sheffield, UK
Irina Temnikova, Bulgarian Academy of Sciences, Bulgaria
Victoria Uren, Aston University, UK
Contact workshop organisers or challenge chair