公开数据集

推特数据集上的 Umass 全球英语

1.21M

864 浏览

0 喜欢

0 次下载

0 条讨论

Internet,Universities and Colleges,Email and Messaging,Linguistics,Languages Classification

数据介绍
文件预览
相关论文
Code
分享讨论(0)
使用声明

启动Notebook开发

数据结构 ? 1.21M

* 以上分析是由系统提取分析形成的结果，具体实际数据为准。

README.md

Context: It can difficult to identify the language that a tweet is written in. In addition to being very short, they often include code-switching, where the user uses two or more languages together, or names borrowed from a different language. This dataset contains tweets from a variety of languages, tagged for whether they are in English or not, whether they contain code-switching, whether they includes names from a different language and whether they were generated automatically. Content: This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity or having been automatically generated. It includes messages sent from 130 different countries. The file all_annotated.tsv contains the dataset of 10,502 tweets used in the paper. Text is encoded as UTF-8. The column headings (also given in the .tsv file) are: tweet ID, ISO country code, tweet date, tweet text, definitely English, ambiguous, definitely not English, code-switched, ambiguous due to named entities, and automatically generated tweets. All annotations are binary; the definitely English, ambiguous, and definitely not English annotations are mutually exclusive. Acknowledgements: This dataset was collected by Su Lin Blodgett, Johnny Tian-Zheng Wei and Brendan O'Connor. It is redistributed here under the [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/). If you use this data in your work, please cite the following paper: Blodgett, Su Lin, Johnny Wei, and Brendan O'Connor. "[A Dataset and Classifier for Recognizing Social Media English](http://www.aclweb.org/anthology/W17-4408)." Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. You can find more information on this dataset and related work on [this website](http://slanglab.cs.umass.edu/TwitterLangID/). Inspiration: * Can you use this dataset to build a classifier that identifies whether a tweet is in English or not? * Can you use this dataset to build a language identifier? (You can check out the authors’ language identifier [here](http://slanglab.cs.umass.edu/TwitterLangID/).)

暂无相关内容。

分享你的想法

去分享你的想法~~

全部内容

欢迎交流分享

开始分享您的观点和意见，和大家一起交流分享.

数据使用声明：

一、数据来源与展示说明：

1、该数据来自于互联网数据采集或服务商的提供，本平台为用户提供数据集的展示与浏览。
2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
3、数据集基本信息来自数据原地址或数据提供方提供的信息，如数据集描述中有描述差异，请以数据原地址或服务商原地址为准。

二、所有权说明：

1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。

三、数据转载说明：

1、如您需要转载本站数据，请保留原数据地址及相关版权声明。

四、侵权与处理说明：

1、如本站中的部分数据涉及侵权展示，请及时联系本站，我们会安排进行数据下线。

所需积分：

0 去赚积分？

864浏览
0下载
0点赞
收藏
分享

Select Language

AI社区

今日排行

本月搜索

Dataset Category