Select Language

AI社区

公开数据集

推特用户性别分类

推特用户性别分类

7.8M
301 浏览
0 喜欢
0 次下载
0 条讨论
Internet,Online Communities,Social Networks,Gender Classification

数据结构 ? 7.8M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    This data set was used to train a CrowdFlower AI gender predictor. [You can read all about the project here](https://www.crowdflower.com/using-machine-learning-to-predict-gender/). Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color. ## Inspiration Here are a few questions you might try to answer with this dataset: - how well do words in tweets and profiles predict user gender? - what are the words that strongly predict male or female gender? - how well do stylistic factors (like link color and sidebar color) predict user gender? ## Acknowledgments Data was provided by the [Data For Everyone Library](https://www.crowdflower.com/data-for-everyone/) on [Crowdflower](https://www.crowdflower.com). Our Data for Everyone library is a collection of our favorite open data jobs that have come through our platform. They're available free of charge for the community, forever. ## The Data The dataset contains the following fields: - **_unit_id**: a unique id for user - **_golden**: whether the user was included in the gold standard for the model; TRUE or FALSE - **_unit_state**: state of the observation; one of *finalized* (for contributor-judged) or *golden* (for gold standard observations) - **_trusted_judgments**: number of trusted judgments (int); always 3 for non-golden, and what may be a unique id for gold standard observations - **_last_judgment_at**: date and time of last contributor judgment; blank for gold standard observations - **gender**: one of *male*, *female*, or *brand* (for non-human profiles) - **gender:confidence**: a float representing confidence in the provided gender - **profile_yn**: "no" here seems to mean that the profile was meant to be part of the dataset but was not available when contributors went to judge it - **profile_yn:confidence**: confidence in the existence/non-existence of the profile - **created**: date and time when the profile was created - **description**: the user's profile description - **fav_number**: number of tweets the user has favorited - **gender_gold**: if the profile is golden, what is the gender? - **link_color**: the link color on the profile, as a hex value - **name**: the user's name - **profile_yn_gold**: whether the profile y/n value is golden - **profileimage**: a link to the profile image - **retweet_count**: number of times the user has retweeted (or possibly, been retweeted) - **sidebar_color**: color of the profile sidebar, as a hex value - **text**: text of a random one of the user's tweets - **tweet_coord**: if the user has location turned on, the coordinates as a string with the format "[*latitude*, *longitude*]" - **tweet_count**: number of tweets that the user has posted - **tweet_created**: when the random tweet (in the **text** column) was created - **tweet_id**: the tweet id of the random tweet - **tweet_location**: location of the tweet; seems to not be particularly normalized - **user_timezone**: the timezone of the user
    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:0 去赚积分?
    • 301浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享