Select Language

AI社区

公开数据集

新闻聚合器数据集

新闻聚合器数据集

27.9M
431 浏览
0 喜欢
3 次下载
0 条讨论
Computer Classification

Data Set Information:News are grouped into clusters that represent pages discussing the same news story. The dataset inc......

数据结构 ? 27.9M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Data Set Information:

    News are grouped into clusters that represent pages discussing the same news story.
    The dataset includes also references to web pages that, at the access time, pointed (has a link to) one of the news page in the collection.

    422937 news pages and divided up into:

    152746 news of business category
    108465 news of science and technology category
    115920 news of business category
    45615 news of health category

    2076 clusters of similar news for entertainment category
    1789 clusters of similar news for science and technology category
    2019 clusters of similar news for business category
    1347 clusters of similar news for health category

    References to web pages containing a link to one news included in the collection are also included. They are represented as pairs of urls corresponding to 2-page browsing sessions. The collection includes 15516 2-page browsing sessions covering 946 distinct clusters divided up into:

    6091 2-page sessions for business category
    9425 2-page sessions for entertainment category


    Attribute Information:

    FILENAME #1: newsCorpora.csv (102.297.000 bytes)
    DEscriptION: News pages
    FORMAT: ID TITLE URL PUBLISHER CATEGORY STORY HOSTNAME TIMESTAMP

    where:
    ID Numeric ID
    TITLE News title
    URL Url
    PUBLISHER Publisher name
    CATEGORY News category (b = business, t = science and technology, e = entertainment, m = health)
    STORY Alphanumeric ID of the cluster that includes news about the same story
    HOSTNAME Url hostname
    TIMESTAMP Approximate time the news was published, as the number of milliseconds since the epoch 00:00:00 GMT, January 1, 1970


    FILENAME #2: 2pageSessions.csv (3.049.986 bytes)
    DEscriptION: 2-page sessions
    FORMAT: STORY HOSTNAME CATEGORY URL

    where:
    STORY Alphanumeric ID of the cluster that includes news about the same story
    HOSTNAME Url hostname
    CATEGORY News category (b = business, t = science and technology, e = entertainment, m = health)
    URL Two space-delimited urls representing a browsing session


    Relevant Papers:

    Fabio Gasparetti. 2017. Modeling user interests from web browsing activities. Data Min. Knowl. Discov. 31, 2 (March 2017), 502-547. DOI: [Web link]


    Citation Request:

    Please refer to the Machine Learning Repository's citation policy


    Provided by Artificial Intelligence Lab @ Faculty of Engineering, Roma Tre University - Italy
    Contact: Fabio Gasparetti, Faculty of Engineering, Roma Tre University - Italy (gaspare '@' dia.uniroma3.it)

    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:10 去赚积分?
    • 431浏览
    • 3下载
    • 0点赞
    • 收藏
    • 分享