公开数据集
数据结构 ? 62.91M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
Context
**`Large Movie Review Dataset v1.0`** . ??
![IMDB wall](https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252)
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.
In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorising movie-unique terms and their associated with observed labels. In the labelled train/test sets, a `negative` review has a **score <= 4 out of 10**, and a `positive` review has a **score >= 7 out of 10**. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews **> 5 and <= 5**.
**`Reference:`** http://ai.stanford.edu/~amaas/data/sentiment/
***NOTE***
**`A starter kernel is here :`** https://www.kaggle.com/atulanandjha/bert-testing-on-imdb-dataset-starter-kernel
**`A kernel to expose Dataset collection :`**
Content
Now let’s understand the task in hand: given a movie review, predict whether it’s `positive` or `negative`.
The dataset we use is **50,000 IMDB** reviews (**25K for train and 25K for test**) from the **PyTorch-NLP** library.
Each review is tagged **pos** or **neg** .
There are **50% positive** reviews and **50% negative** reviews both in train and test sets.
Columns:
`text :` Reviews from people.
`Sentiment :` Negative or Positive tag on the review/feedback (Boolean).
Acknowledgements
**When using this Dataset Please `Cite` this ACL paper using :**
> @InProceedings{
> maas-EtAl:2011:ACL-HLT2011,
> author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
> title = {Learning Word Vectors for Sentiment Analysis},
> booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
> month = {June},
> year = {2011},
> address = {Portland, Oregon, USA},
> publisher = {Association for Computational Linguistics},
> pages = {142--150},
> url = {http://www.aclweb.org/anthology/P11-1015}
> }
**Link to ref Dataset:** https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/datasets/imdb.html
https://www.samyzaf.com/ML/imdb/imdb.html
Inspiration
BERT and other Transformer Architecture models have always been on hype recently due to a great breakthrough by introducing Transfer Learning in NLP. So, Let's use this simple yet efficient Data-set to Test these models, and also compare our results with theirs. Also, I invite fellow researchers to try out their State of the Art Algorithms on this data-set.
×
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
暂无相关内容。
暂无相关内容。
- 分享你的想法
去分享你的想法~~
全部内容
欢迎交流分享
开始分享您的观点和意见,和大家一起交流分享.
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。