Select Language

AI社区

公开数据集

IMDB 50K电影评论(测试你的BERT)

IMDB 50K电影评论(测试你的BERT)

62.91M
251 浏览
0 喜欢
0 次下载
0 条讨论
Arts and Entertainment,Internet,Movies and TV Shows,NLP,Text Data,Art Classification

数据结构 ? 62.91M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Context **`Large Movie Review Dataset v1.0`** . ?? ![IMDB wall](https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252) This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorising movie-unique terms and their associated with observed labels. In the labelled train/test sets, a `negative` review has a **score <= 4 out of 10**, and a `positive` review has a **score >= 7 out of 10**. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews **> 5 and <= 5**. **`Reference:`** http://ai.stanford.edu/~amaas/data/sentiment/ ***NOTE*** **`A starter kernel is here :`** https://www.kaggle.com/atulanandjha/bert-testing-on-imdb-dataset-starter-kernel **`A kernel to expose Dataset collection :`** Content Now let’s understand the task in hand: given a movie review, predict whether it’s `positive` or `negative`. The dataset we use is **50,000 IMDB** reviews (**25K for train and 25K for test**) from the **PyTorch-NLP** library. Each review is tagged **pos** or **neg** . There are **50% positive** reviews and **50% negative** reviews both in train and test sets. Columns: `text :` Reviews from people. `Sentiment :` Negative or Positive tag on the review/feedback (Boolean). Acknowledgements **When using this Dataset Please `Cite` this ACL paper using :** > @InProceedings{ > maas-EtAl:2011:ACL-HLT2011, > author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, > title = {Learning Word Vectors for Sentiment Analysis}, > booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, > month = {June}, > year = {2011}, > address = {Portland, Oregon, USA}, > publisher = {Association for Computational Linguistics}, > pages = {142--150}, > url = {http://www.aclweb.org/anthology/P11-1015} > } **Link to ref Dataset:** https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/datasets/imdb.html https://www.samyzaf.com/ML/imdb/imdb.html Inspiration BERT and other Transformer Architecture models have always been on hype recently due to a great breakthrough by introducing Transfer Learning in NLP. So, Let's use this simple yet efficient Data-set to Test these models, and also compare our results with theirs. Also, I invite fellow researchers to try out their State of the Art Algorithms on this data-set.
    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:0 去赚积分?
    • 251浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享