Select Language





609 浏览
0 喜欢
0 次下载
0 条讨论
Internet,Online Communities,Linguistics,Languages Classification

数据结构 ? 763.34M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    Context: “A blog (a truncation of the expression "weblog") is a discussion or informational website published on the World Wide Web consisting of discrete, often informal diary-style text entries ("posts"). Posts are typically displayed in reverse chronological order, so that the most recent post appears first, at the top of the web page. Until 2009, blogs were usually the work of a single individual, occasionally of a small group, and often covered a single subject or topic.” -- Wikipedia article “[Blog](” This dataset contains text from blogs written on or before 2004, with each blog being the work of a single user. Content: The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups: * 8240 "10s" blogs (ages 13-17), * 8086 "20s" blogs(ages 23-27) * 2994 "30s" blogs (ages 33-47). For each age group there are an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink. Acknowledgements The corpus may be freely used for non-commercial research purposes. Any resulting publications should cite the following: J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. URL: Inspiration: * This dataset contains information on writers demographics, including their age, gender and zodiac sign. Can you build a classifier to guess someone’s zodiac sign from blog posts they’ve written? * Which are bigger: differences between demographic groups or differences between blogs on different topics? You may also like: * [News and Blog Data Crawl: Content section from over 160,000 news and blog articles]( * [20 Newsgroups: A collection of ~18,000 newsgroup documents from 20 different newsgroups](



    • 分享你的想法


    所需积分:0 去赚积分?
    • 609浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享