公开数据集
数据结构 ? 716.62M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
Overview
This dataset contains nearly every opinion authored by the Supreme Court of the United States (SCOTUS) from its inception in 1789 until 2020. There are roughly 36,000 of them, authored by 98 different justices.
SCOTUS opinions are a pretty interesting dataset from a number of different perspectives, and there’s a lot you can do with them (concept extraction, ideology analysis, change over time, author prediction, &c). And given their role in shaping public and private life in the US, the more light shed on them the better.
Why this dataset is necessary
I originally compiled these for an NLP project which you can see as the first uploaded kernel here ("[Preliminary analysis and topic modeling][1]"). The gathering and cleaning required to get a workable dataset took the better part of a week – providing the impetus to share the results here.
All SCOTUS opinions are publicly available on numerous platforms (Findlaw, Justia, Courtlistener, etc.). But it seems that no free resource provides the sort of well-formatted bulk downloads that any comprehensive study requires. Manually copying and pasting tens of thousands of opinions from these resources is prohibitively time-consuming.
This dataset gives users a simple and exhaustive CSV with the text of each opinion and all associated metadata (author, date, case, &c.).
How this dataset was produced
Courtlistener.com is the only site I found that has [an API][2] and [bulk downloads of Opinion jsons][3]**. Unfortunately, these Opinion jsons present many difficulties – most notably, they do not actually contain either the author name or the text of the opinions as such.
What the courtlistener.com Opinion jsons do contain is the html of a full record of every case heard by SCOTUS. Each record one single html div of text containing the title, all opinions submitted for the case (majority, dissenting, concurring), and texts of those opinions, and footnotes. None of these features are in separate divs or have distinguishing class labels.
To produce this dataset, I downloaded the Opinion jsons and joined them with Opinion Cluster jsons (which contain structured data about the case, such as the date filed and the name of the plaintiff). I then wrote a fairly elaborate set of rules for parsing the Opinion html into (author / majority-vs-dissenting-vs-concurring / opinion-text) clusters with 98+% accuracy. Combining this information with the meta-information from the Opinion Cluster jsons produced the dataset you see here.
The complete loading and cleaning script is included here so that you can take a look for yourself and make improvements if you want. (Apologies for any ungainliness – I’d hoped this would be a simpler process, so the code probably isn't as well structured as it could be.)
Caveats
1. While I did my best to hash out reliable text-parsing rules and check the results, this dataset is not 100% clean or complete. There are some kinks in original sources, and I'm sure I've inadvertently added one or two. Treat the data with your usual due diligence. If you need more certainty, look at the linked full-text pages in the absolute_url field, and/or see what you can do to improve the cleaning functions.
2. Speaking of which: if you do dig into the loading and cleaning functions, be prepared for some tangles: they’re pretty finicky and can be a headache to debug. If you can see better ways to do the same thing, I’d be eager to hear your ideas.
3. I *highly* recommend eliminating all author names with fewer than five attributed opinions for most analyses. (The two lines of code that do so are at the very bottom of the cleaning function, commented out.) Some of these are just very short-tenured justices, but others typos or mislabels in the original data. I’ve left all of them in so that the user has the choice.
4. Consider removing all records where the federal_cite_one identifier is duplicated but case_name is distinct. In many cases, one or both records represent a simple procedural event (appeal status determined, etc.).
5. Some author names pre-1970 include a variation: a differently formatted apostrophe, misspelling, etc. (I’ve manually cleaned the post-1970 authors.) You may want to look at unique author_name values and clean as necessary.
6. If you have doubts or questions about an opinion, just follow the absolute_url to read the full case page.
7. There are a small number of outlier opinions that are extremely long. These are nonetheless real. (I guess even SCOTUS justices get carried away sometimes.)
Recommendations
1. In the interest of preserving maximum information, this dataset labels first dissenting opinions as “dissenting” and any subsequent dissenting opinion in a case file “second_dissenting”. For many analyses this distinction won’t be important – especially if you drop very short opinions (see 3 below) – and they can be relabeled to just “dissenting”.
2. “Per curiam” opinions – opinions the Court determined to be so straightforward and uncontroversial as to warrant no very detailed opinion or named author – are quite different from other opinions; most general analyses would probably do well to sequester them.
3. For qualitative analysis of some kinds, it may also be helpful to exclude opinions that are very short (I’ve found 3,000 characters to be a good threshold). These are mostly concurring or dissenting opinions that merely constitute brief commentary on the majority opinions.
4. As you can see in the kernel I’ve uploaded here, I found that K-means or agglomerative clustering, followed by LDA, produced the best semantic clusters. Those techniques may prove a good starting point for playing around with the dataset.
5. Again, see caveat 3 above regarding dropping very rare author names.
[1]: https://www.kaggle.com/gqfiddler/preliminary-analysis-and-topic-modeling
[2]: https://www.courtlistener.com/api/rest-info/
[3]: https://www.courtlistener.com/api/bulk-data/
** as of February 2020, all SCOTUS opinions are contained in the file scotus.tar.gz file on the bulk downloads page
×
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
暂无相关内容。
暂无相关内容。
- 分享你的想法
去分享你的想法~~
全部内容
欢迎交流分享
开始分享您的观点和意见,和大家一起交流分享.
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。