Select Language

AI社区

公开数据集

推特数据集上的 Umass 全球英语

推特数据集上的 Umass 全球英语

1.21M
591 浏览
0 喜欢
0 次下载
0 条讨论
Internet,Universities and Colleges,Email and Messaging,Linguistics,Languages Classification

数据结构 ? 1.21M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Context: It can difficult to identify the language that a tweet is written in. In addition to being very short, they often include code-switching, where the user uses two or more languages together, or names borrowed from a different language. This dataset contains tweets from a variety of languages, tagged for whether they are in English or not, whether they contain code-switching, whether they includes names from a different language and whether they were generated automatically. Content: This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity or having been automatically generated. It includes messages sent from 130 different countries. The file all_annotated.tsv contains the dataset of 10,502 tweets used in the paper. Text is encoded as UTF-8. The column headings (also given in the .tsv file) are: tweet ID, ISO country code, tweet date, tweet text, definitely English, ambiguous, definitely not English, code-switched, ambiguous due to named entities, and automatically generated tweets. All annotations are binary; the definitely English, ambiguous, and definitely not English annotations are mutually exclusive. Acknowledgements: This dataset was collected by Su Lin Blodgett, Johnny Tian-Zheng Wei and Brendan O'Connor. It is redistributed here under the [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/). If you use this data in your work, please cite the following paper: Blodgett, Su Lin, Johnny Wei, and Brendan O'Connor. "[A Dataset and Classifier for Recognizing Social Media English](http://www.aclweb.org/anthology/W17-4408)." Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. You can find more information on this dataset and related work on [this website](http://slanglab.cs.umass.edu/TwitterLangID/). Inspiration: * Can you use this dataset to build a classifier that identifies whether a tweet is in English or not? * Can you use this dataset to build a language identifier? (You can check out the authors’ language identifier [here](http://slanglab.cs.umass.edu/TwitterLangID/).)
    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:0 去赚积分?
    • 591浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享