Data Scientists often use crowdsourcing platforms, such as Amazon Mechanical Turk or CrowdFlower to collect labels for their data. Controlling high quality and timeless execution of tasks is an important part of such collection process. It is not possible (or not efficient) to manually check every worker assignment. There is an intuition that there quality could be predicted based on workers task browser behaviour (e.g. key presses, scrolling, mouse clicks, tab switching). In this dataset there are assignment results for 3 different crowdsourcing tasks launched on CrowdFlower, along with associated workers behaviour.
We collected data running 3 tasks:
- Image labelling,
- Receipt Transcription,
- Business Search.
![Task User Interface][1]
Tasks are described in tasks.csv.
Results for corresponding tasks are given in files: results_{task_id}.csv. Workers's activity could be found in the following files:
* activity_keyboard.csv - timestamps of keyboard keys pressed
* activity_mouse.csv - timestamps of mouse clicks with associated HTML elements
* activity_tab.csv - timestamps of event task browser tab changes (opened, active, hidden, closed)
* activity_page.csv - a summary of events happened in the task page every 2 seconds (boolean keyboard activity, boolean mouse movement activity, boolean scrolling activity, the position of the screen, boolean if text was selected)
Result files have a similar structure to the original one given by CrowdFlower:
* _unit_id: A unique ID number created by the system for each row
* _created_at: The time the contributor submitted the judgement
* _golden: This will be "true" if this is a test question, otherwise it is "false"
* _id: A unique ID number generated for this specific judgment
* _missed: This will be "true" if the row is an incorrect judgment on a test question.
* _started_at: The time at which the contributor started working on the judgement
* _tainted: This will be "true" if the contributor has been flagged for falling below the required accuracy. This judgment will not be used in the aggregation.
* _channel: The work channel that the contributor accessed the job through
* _trust: The contributor's accuracy. Learn more about trust here
* _worker_id: A unique ID number assigned to the contributor (in the current dataset MD5 value is given)
* _country: The country the contributor is from
* _region: A region code for the area the contributor is from
* _city: The city the contributor is from
* _ip: The IP address for the contributor (in the current dataset MD5 value is given)
* {{field}}: There will be a column for each field in the job, with a header equal to the field's name.
* {{field}}_gold: The correct answer for the test question
We thank crowd workers who accomplished our not always exciting tasks on CrowdFlower.
[1]: http://bit.ly/2wWtuDe
