公开数据集
数据结构 ? 1.21M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
Context:
It can difficult to identify the language that a tweet is written in. In addition to being very short, they often include code-switching, where the user uses two or more languages together, or names borrowed from a different language.
This dataset contains tweets from a variety of languages, tagged for whether they are in English or not, whether they contain code-switching, whether they includes names from a different language and whether they were generated automatically.
Content:
This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity or having been automatically generated. It includes messages sent from 130 different countries.
The file all_annotated.tsv contains the dataset of 10,502 tweets used in the paper. Text is encoded as UTF-8.
The column headings (also given in the .tsv file) are: tweet ID, ISO country code, tweet date, tweet text, definitely English, ambiguous, definitely not English, code-switched, ambiguous due to named entities, and automatically generated tweets.
All annotations are binary; the definitely English, ambiguous, and definitely not English annotations are mutually exclusive.
Acknowledgements:
This dataset was collected by Su Lin Blodgett, Johnny Tian-Zheng Wei and Brendan O'Connor. It is redistributed here under the [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/). If you use this data in your work, please cite the following paper:
Blodgett, Su Lin, Johnny Wei, and Brendan O'Connor. "[A Dataset and Classifier for Recognizing Social Media English](http://www.aclweb.org/anthology/W17-4408)." Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017.
You can find more information on this dataset and related work on [this website](http://slanglab.cs.umass.edu/TwitterLangID/).
Inspiration:
* Can you use this dataset to build a classifier that identifies whether a tweet is in English or not?
* Can you use this dataset to build a language identifier? (You can check out the authors’ language identifier [here](http://slanglab.cs.umass.edu/TwitterLangID/).)
×
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
暂无相关内容。
暂无相关内容。
- 分享你的想法
去分享你的想法~~
全部内容
欢迎交流分享
开始分享您的观点和意见,和大家一起交流分享.
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。