公开数据集
数据结构 ? 1743.96M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
Goal
This is a small project lead by Yury Kashnitsky within OpenDataScience and Amsterdam Data Science communities. We plan to explore transfer & semi-supervised learning techniques for NLP tasks, mainly for classification. The idea is to develop best practices for using such models as BERT & ULMFiT (maybe smth else as well) for production-grade usage. Possible outcomes of this collaboration:
- primarily, shared experience within this group, and advance in our own projects
- articles sharing our experience (ex. Medium)
- shared models, ex. trained LM for ULMFiT in Dutch
- small library, ex. to productionize ULMFiT models (if they turn out to work best)
Anybody is welcome to join and **share findings** via Kernels and Discussions.
Datasets
We are gathering several datasets in English, Russian and Dutch. Each of them addresses the **general task** - to utilize loads of unlabeled texts to improve classification of (scarce) labeled texts. So for each task we have the following files:
- train.csv (small)
- validation.csv (small)
- unlabeled.csv (large)
- test.csv (optionally, within competitions)
Current datasets are:
- Amazon pet product reviews classification (English, 6 classes, 52k train, 17k valid, 17k test, 100k unlabeled), [competition](https://www.kaggle.com/c/amazon-pet-product-reviews-classification/) , see [Kernels](https://www.kaggle.com/c/amazon-pet-product-reviews-classification/kernels) for baselines: logit-tfidf, ULMFiT & BERT
- Amazon healthcare reviews (English) (6 classes, 7k train, 3k valid, 200k unlabeled )
- Clickbait news detection (English, 3 classes, 25k train, 5.5k valid, 3.5k test, 80k unlabeled), [competition](https://www.kaggle.com/c/clickbait-news-detection), see [Kernels](https://www.kaggle.com/c/clickbait-news-detection/kernels) for baselines: logit-tfidf, ULMFiT & BERT.
- Dutch book reviews (Dutch, 2 classes, 14k train, 6k valid, 90k validation).
Acknowledgements
Thanks to Vladislav Lyalin for the clickbait news data (original [competition](https://www.kaggle.com/c/dlinnlp-spring-2019-clf) by ipavlov) and to [Benjamin van der Burgh](https://github.com/benjaminvdb) for Dutch reviews data (source [repository](https://github.com/benjaminvdb/110kDBRD)).
Background image credit: Jeremy Howard, [fast.ai Lesson 4](https://course.fast.ai/videos/?lesson=4)
×
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
暂无相关内容。
暂无相关内容。
- 分享你的想法
去分享你的想法~~
全部内容
欢迎交流分享
开始分享您的观点和意见,和大家一起交流分享.
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。