公开数据集
数据结构 ? 103.49M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
# Dataset: The files on your computer.
Crab is a command line tool for Mac and Windows that scans file data into a SQLite database, so you can run SQL queries over it.
e.g. (Win) C:> crab C:
ome\path\MyProject
or (Mac) $ crab /some/path/MyProject
You get a CRAB> prompt where you can enter SQL queries on the data, e.g. Count files by extension
SELECT extension, count(*)
FROM files
GROUP BY extension;
e.g. List the 5 biggest directories
SELECT parentpath, sum(bytes)/1e9 as GB
FROM files
GROUP BY parentpath
ORDER BY sum(bytes) DESC LIMIT 5;
Crab provides a virtual table, fileslines, which exposes file contents to SQL
e.g. Count TODO and FIXME entries in any .c files, recursively
SELECT fullpath, count(*) FROM fileslines
WHERE parentpath like '/Users/GN/HL3/%' and extension = '.c'
and (data like '%TODO%' or data like '%FIXME%')
GROUP BY fullpath;
As well there are functions to run programs or shell commands on any subset of files, or lines within files e.g. (Mac) unzip all the .zip files, recursively
SELECT exec('unzip', '-n', fullpath, '-d', '/Users/johnsmith/Target Dir/')
FROM files
WHERE parentpath like '/Users/johnsmith/Source Dir/%' and extension = '.zip';
(Here -n tells _unzip_ not to overwrite anything, and -d specifies target directory)
There is also a function to write query output to file, e.g. (Win) Sort the lines of all the .txt files in a directory and write them to a new file
SELECT writeln('C:\Users\SJohnson\dictionary2.txt', data)
FROM fileslines
WHERE parentpath = 'C:\Users\SJohnson\' and extension = '.txt'
ORDER BY data;
In place of the interactive prompt you can run queries in batch mode. E.g. Here is a one-liner that returns the full path all the files in the current directory
C:> crab -batch -maxdepth 1 . "SELECT fullpath FROM files"
Crab SQL can also be used in Windows batch files, or Bash scripts, e.g. for ETL processing.
**Crab is free for personal use, $5/mo commercial**
See more details here (mac): [http://etia.co.uk/][1] or here (win): [http://etia.co.uk/win/about/][2]
An example SQLite database (Mac data) has been uploaded for you to play with. It includes an example files table for the directory tree you get when downloading the Project Gutenberg corpus, which contains 95k directories and 123k files.
To scan your own files, and get access to the virtual tables and support functions you have to use the Crab SQLite shell, available for download from this page (Mac): [http://etia.co.uk/download/][3] or this page (Win): [http://etia.co.uk/win/download/][4]
# Content
FILES TABLE
The FILES table contains details of every item scanned, file or directory. All columns are indexed except 'mode'
COLUMNS
fileid (int) primary key -- files table row number, a unique id for each item
name (text) -- item name e.g. 'Hei.ttf'
bytes (int) -- item size in bytes e.g. 7502752
depth (int) -- how far scan recursed to find the item, starts at 0
accessed (text) -- datetime item was accessed
modified (text) -- datetime item was modified
basename (text) -- item name without path or extension, e.g. 'Hei'
extension (text) -- item extension including the dot, e.g. '.ttf'
type (text) -- item type, 'f' for file or 'd' for directory
mode (text) -- further type info and permissions, e.g. 'drwxr-xr-x'
parentpath (text) -- absolute path of directory containing the item, e.g. '/Library/Fonts/'
fullpath (text) unique -- parentpath of the item concatenated with its name, e.g. '/Library/Fonts/Hei.ttf'
PATHS
1) parentpath and fullpath don't support abbreviations such as ~ . or .. They're just strings.
2) Directory paths all have a '/' on the end.
FILESLINES TABLE
The FILESLINES table is for querying data content of files. It has line number and data columns, with one row for
each line of data in each file scanned by Crab.
This table isn't available in the example dataset, because it's a virtual table and doesn't physically contain data.
COLUMNS
linenumber (int) -- line number within file, restarts count from 1 at the first line of each file
data (text) -- data content of the files, one entry for each line
FILESLINES also duplicates the columns of the FILES table: fileid, name, bytes, depth, accessed, modified,
basename, extension, type, mode, parentpath, and fullpath. This way you can restrict which files are searched without having to join tables.
# Example Gutenberg data
An example SQLite database (Mac data), _database.sqlite_, has been uploaded for you to play with. It includes an example _files_ table for the directory tree you get when downloading the Project Gutenberg corpus, which contains 95k directories and 123k files.
You can open it with any SQLite shell, or query it with any SQLite query tools, but the virtual tables such as _fileslines_ and support functions such as EXEC() and WRITELN() only work from the Crab shell that you have to download from etia.co.uk.
# Uses
* Reporting and analysis of filesystem contents
* Finding files and directories
* Filesystem operations such moving, copying, deleting, unzipping files
* ETL processing
[1]: http://etia.co.uk/
[2]: http://etia.co.uk/win/about/
[3]: http://etia.co.uk/download/
[4]: http://etia.co.uk/win/download/
×
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
暂无相关内容。
暂无相关内容。
- 分享你的想法
去分享你的想法~~
全部内容
欢迎交流分享
开始分享您的观点和意见,和大家一起交流分享.
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。