公开数据集

为自然语言处理探索迁移学习

1743.96M

1006 浏览

0 喜欢

0 次下载

0 条讨论

Business,Education,Social Science,NLP,Classification,Research,Transfer Learning Classification

数据介绍
文件预览
相关论文
Code
分享讨论(0)
使用声明

启动Notebook开发

数据结构 ? 1743.96M

* 以上分析是由系统提取分析形成的结果，具体实际数据为准。

README.md

Goal This is a small project lead by Yury Kashnitsky within OpenDataScience and Amsterdam Data Science communities. We plan to explore transfer & semi-supervised learning techniques for NLP tasks, mainly for classification. The idea is to develop best practices for using such models as BERT & ULMFiT (maybe smth else as well) for production-grade usage. Possible outcomes of this collaboration: - primarily, shared experience within this group, and advance in our own projects - articles sharing our experience (ex. Medium) - shared models, ex. trained LM for ULMFiT in Dutch - small library, ex. to productionize ULMFiT models (if they turn out to work best) Anybody is welcome to join and **share findings** via Kernels and Discussions. Datasets We are gathering several datasets in English, Russian and Dutch. Each of them addresses the **general task** - to utilize loads of unlabeled texts to improve classification of (scarce) labeled texts. So for each task we have the following files: - train.csv (small) - validation.csv (small) - unlabeled.csv (large) - test.csv (optionally, within competitions) Current datasets are: - Amazon pet product reviews classification (English, 6 classes, 52k train, 17k valid, 17k test, 100k unlabeled), [competition](https://www.kaggle.com/c/amazon-pet-product-reviews-classification/) , see [Kernels](https://www.kaggle.com/c/amazon-pet-product-reviews-classification/kernels) for baselines: logit-tfidf, ULMFiT & BERT - Amazon healthcare reviews (English) (6 classes, 7k train, 3k valid, 200k unlabeled ) - Clickbait news detection (English, 3 classes, 25k train, 5.5k valid, 3.5k test, 80k unlabeled), [competition](https://www.kaggle.com/c/clickbait-news-detection), see [Kernels](https://www.kaggle.com/c/clickbait-news-detection/kernels) for baselines: logit-tfidf, ULMFiT & BERT. - Dutch book reviews (Dutch, 2 classes, 14k train, 6k valid, 90k validation). Acknowledgements Thanks to Vladislav Lyalin for the clickbait news data (original [competition](https://www.kaggle.com/c/dlinnlp-spring-2019-clf) by ipavlov) and to [Benjamin van der Burgh](https://github.com/benjaminvdb) for Dutch reviews data (source [repository](https://github.com/benjaminvdb/110kDBRD)). Background image credit: Jeremy Howard, [fast.ai Lesson 4](https://course.fast.ai/videos/?lesson=4)

暂无相关内容。

分享你的想法

去分享你的想法~~

全部内容

欢迎交流分享

开始分享您的观点和意见，和大家一起交流分享.

数据使用声明：

一、数据来源与展示说明：

1、该数据来自于互联网数据采集或服务商的提供，本平台为用户提供数据集的展示与浏览。
2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
3、数据集基本信息来自数据原地址或数据提供方提供的信息，如数据集描述中有描述差异，请以数据原地址或服务商原地址为准。

二、所有权说明：

1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。

三、数据转载说明：

1、如您需要转载本站数据，请保留原数据地址及相关版权声明。

四、侵权与处理说明：

1、如本站中的部分数据涉及侵权展示，请及时联系本站，我们会安排进行数据下线。

所需积分：

0 去赚积分？

1006浏览
0下载
0点赞
收藏
分享

Select Language

AI社区

今日排行

本月搜索

Dataset Category