公开数据集
数据结构 ? 0.03M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
Context
New information about the environment appears in public access every second: reports, books, articles, news, etc. are published in different languages. Automatic classification will allow it to be processed and used more efficiently for decision-making.
Content
This version of the dataset contains 2 files so far - an English-language dataset from the English-language edition of the book, where I am the co-author, and a Ukrainian-language dataset from a separate Ukrainian-language edition of this book. These datasets contain approximately 95% of the same information:
* Text - One or more sentences from reports or news
and 5 binary target features:
* Env_problems - Is the text about an environmental problem? (0 or 1)
* Pollution - Is the text about environmental pollution? (0 or 1)
* Treatment - Is the text about treatment plants or environmental technologies? (0 or 1)
* Climate - Is the text about climatic indicators? (0 or 1)
* Biomonitoring - Is the text about biological, biotic monitoring in water or in a river basin? (0 or 1)
**NLP, Multilabel binary classification**
The source of information is published in the printing house in English and Ukrainian in two separate books. The co-author of both editions of this book is the co-author of this dataset.
1. Sentences are copied from a book (English - from an English edition, Ukrainian - from a Ukrainian edition) in PDF format together with various special characters (page numbers, etc.) so that there is a certain "noise".
2. Some special characters have been selectively removed, including some commas and periods.
3. Determination of target features was made by the authors of the dataset, including the co-author of this book.
Over time, I will plan to add more datasets with news and other reports (English, Ukrainian, and both).
Acknowledgements
* Project financed by Swedish International Development Agency Institutional Strengthening and Capacity Building for the Ukrainian River Basin Management Authority" (2010-2014)
* [State Agency of Water Resources of Ukraine](https://www.davr.gov.ua/)
* all the co-authors of the book "River Basin Management Plan for Pivdenny Bug: river basin analysis and measures (Ukrainian) / Afanasiev S., Bedz N., Kryzhanivsky E., Mokin V., etc. / Afanasiev S., Peters A., Stashuk V., Iarochevitch O. // Ed.: S. Afanasiev, A. Peters, V. Stashuk, O. Iarochevitch. – Edition: Interservice publishing house, Kiev, 2014. – 188 pages. – ISBN: 978-617-696-258-8. – DOI: 10.13140/2.1.1707.2325, incl. both edition of its: [English edition of the book](https://www.researchgate.net/publication/275210400_River_Basin_Management_Plan_for_Pivdenny_Bug_river_basin_analysis_and_measures) and [Ukrainian edition of the book](https://mk-vodres.davr.gov.ua/sites/default/files/Bug_plan_final_2.pdf)
* Students majoring 126 Information systems and technologies of the Vinnytsia National Technical University which helped me create this dataset from both editions of that book: [Dmytro Pasichniuk](https://www.kaggle.com/kenywhite), [Oleksandr Radetskyi](https://www.kaggle.com/sasharadeckiy)
* for a good photo used for a dataset to [Vasiliy Kostiushyn](https://www.facebook.com/v.kostiushyn) - the head of the department of the I.I. Schmalhausen Institute of Zoology of National Academy of Sciences of Ukraine.
Inspiration
It is proposed to solve the following task:
Classify the text as accurately as possible for each of the 5 target binary features (for English-language dataset, for Ukrainian-language dataset, or both), provided that the accuracy will be determined on a pre-selected part of the test data in the amount of at least 40% randomly selected data of the total data in the relevant language or languages. Don't use or incorporate information from hand labeling or human prediction of the validation dataset or test data records. You may use external data with data from this dataset to develop and test your models, but only public external data indicating the source.
×
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
暂无相关内容。
暂无相关内容。
- 分享你的想法
去分享你的想法~~
全部内容
欢迎交流分享
开始分享您的观点和意见,和大家一起交流分享.
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。