Select Language





263 浏览
0 喜欢
0 次下载
0 条讨论
Social Science,Investing,NLP,Literature,Environment,Binary Classification,Multilabel Classification,Water Bodies Classification

数据结构 ? 0.03M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    Context New information about the environment appears in public access every second: reports, books, articles, news, etc. are published in different languages. Automatic classification will allow it to be processed and used more efficiently for decision-making. Content This version of the dataset contains 2 files so far - an English-language dataset from the English-language edition of the book, where I am the co-author, and a Ukrainian-language dataset from a separate Ukrainian-language edition of this book. These datasets contain approximately 95% of the same information: * Text - One or more sentences from reports or news and 5 binary target features: * Env_problems - Is the text about an environmental problem? (0 or 1) * Pollution - Is the text about environmental pollution? (0 or 1) * Treatment - Is the text about treatment plants or environmental technologies? (0 or 1) * Climate - Is the text about climatic indicators? (0 or 1) * Biomonitoring - Is the text about biological, biotic monitoring in water or in a river basin? (0 or 1) **NLP, Multilabel binary classification** The source of information is published in the printing house in English and Ukrainian in two separate books. The co-author of both editions of this book is the co-author of this dataset. 1. Sentences are copied from a book (English - from an English edition, Ukrainian - from a Ukrainian edition) in PDF format together with various special characters (page numbers, etc.) so that there is a certain "noise". 2. Some special characters have been selectively removed, including some commas and periods. 3. Determination of target features was made by the authors of the dataset, including the co-author of this book. Over time, I will plan to add more datasets with news and other reports (English, Ukrainian, and both). Acknowledgements * Project financed by Swedish International Development Agency Institutional Strengthening and Capacity Building for the Ukrainian River Basin Management Authority" (2010-2014) * [State Agency of Water Resources of Ukraine]( * all the co-authors of the book "River Basin Management Plan for Pivdenny Bug: river basin analysis and measures (Ukrainian) / Afanasiev S., Bedz N., Kryzhanivsky E., Mokin V., etc. / Afanasiev S., Peters A., Stashuk V., Iarochevitch O. // Ed.: S. Afanasiev, A. Peters, V. Stashuk, O. Iarochevitch. – Edition: Interservice publishing house, Kiev, 2014. – 188 pages. – ISBN: 978-617-696-258-8. – DOI: 10.13140/2.1.1707.2325, incl. both edition of its: [English edition of the book]( and [Ukrainian edition of the book]( * Students majoring 126 Information systems and technologies of the Vinnytsia National Technical University which helped me create this dataset from both editions of that book: [Dmytro Pasichniuk](, [Oleksandr Radetskyi]( * for a good photo used for a dataset to [Vasiliy Kostiushyn]( - the head of the department of the I.I. Schmalhausen Institute of Zoology of National Academy of Sciences of Ukraine. Inspiration It is proposed to solve the following task: Classify the text as accurately as possible for each of the 5 target binary features (for English-language dataset, for Ukrainian-language dataset, or both), provided that the accuracy will be determined on a pre-selected part of the test data in the amount of at least 40% randomly selected data of the total data in the relevant language or languages. Don't use or incorporate information from hand labeling or human prediction of the validation dataset or test data records. You may use external data with data from this dataset to develop and test your models, but only public external data indicating the source.



    • 分享你的想法


    所需积分:0 去赚积分?
    • 263浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享