该数据集由 CALO 项目(学习和组织的认知助手)收集和准备。它包含来自大约 150 名用户(主要是安然的高级管理人员)的数据,这些数据被组织成文件夹。该语料库总共包含大约 0.5M 条消息。这些数据最初是由联邦能源监管委员会在调查期间公开并发布到网络上的。
Research uses of the dataset
A paper describing the Enron data was presented at the 2004 CEAS conference.
Some experiments associated with this data are described on Ron Bekkerman's home page.
A social-network analysis of the data, including "useful mappings between the MD5 digest of the email bodies and such things as authors, recipients, etc", is available from Andres Corrada-Emmanuel.
A group from SIMS, UC Berkeley provides search, visualization, and some email that has been labeled with topic and sentiment labels
Jitesh Shetty has put up a database of link-analysis results.
A version of the dataset with all attachments is available from EDRM.
Work at the University of Pennsylvania includes a query dataset for email search as well as a tool for generating spelling errors based on the Enron corpus.
Kimmie Farrington and colleagues published a paper in 2011 that uses the Enron dataset as part of the test corpus for their work on crowdsourcing human vs. computer generated classification explanation: see Hutton, Amanda, Alexander Liu, and Cheryl Martin. "Crowdsourcing evaluations of classifier interpretability." In Proceedings of the 2012 AAAI Spring Symposium on Wisdom of the Crowd
Parakweet has released an open source set of Enron sentence data, labeled for speech acts.
A set of sentence level annotations (of what requires action or response from user) has been released by Charlie Oxborough.
