Select Language

AI社区

公开数据集

TTC-3600:土耳其文本分类数据集的基准数据集

TTC-3600:土耳其文本分类数据集的基准数据集

2.5M
822 浏览
0 喜欢
0 次下载
0 条讨论
Computer Classification

Assist.Prof.Dr. Deniz KILIN??, Faculty of Technology, Celal Bayar University, Turkeydrdenizkilinc'@'gmail.comDat......

数据结构 ? 2.5M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Assist.Prof.Dr. Deniz KILIN??, Faculty of Technology, Celal Bayar University, Turkey
    drdenizkilinc'@'gmail.com


    Data Set Information:

    The dataset consists of a total of 3600 documents including 600 news/texts from six categories a€“ economy, culture-arts, health, politics, sports and technology a€“ obtained
    from six well-known news portals and agencies (Hurriyet,Posta,Iha,HaberTurk,Radikal and Zaman). documents of TTC-3600 dataset were collected between May and July 2015 via Rich Site Summary (RSS) feeds from six categories of the respective portals. All java scripts, HTML tags ( < img> , < a > , < p > , < strong> etc.), operators, punctuations, non-printable characters and irrelevant data such as advertising are removed.

    Three additional dataset versions are created on TTC-3600 by implementing different stemming methods. In all versions of datasets, first, removal-based pre-processing, which is explained in Section 3.2 in detail, is used. Then Turkish stop-words that have no discriminatory power (pronouns, prepositions, conjunctions, etc.) in regard to TC are removed
    from datasets except for the original one. In this study, a semi-automatically constructed stop-words list that contains 147 words is utilized.


    Attribute Information:

    ARFF (Attribute-Relation File Format) Weka format


    Relevant Papers:

    [Web link]



    Citation Request:

    K?±l?±n?§, Deniz, et al. 'TTC-3600: A new benchmark dataset for Turkish text categorization.' Journal of Information Science (2015): 0165551515620551.

    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:6 去赚积分?
    • 822浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享