公开数据集

计算机上的文件

103.49M

352 浏览

0 喜欢

0 次下载

0 条讨论

Business,Computer Science,Software,Programming Classification

数据介绍
文件预览
相关论文
Code
分享讨论(0)
使用声明

启动Notebook开发

数据结构 ? 103.49M

* 以上分析是由系统提取分析形成的结果，具体实际数据为准。

README.md

# Dataset: The files on your computer. Crab is a command line tool for Mac and Windows that scans file data into a SQLite database, so you can run SQL queries over it. e.g. (Win) C:> crab C: ome\path\MyProject or (Mac) $ crab /some/path/MyProject You get a CRAB> prompt where you can enter SQL queries on the data, e.g. Count files by extension SELECT extension, count(*) FROM files GROUP BY extension; e.g. List the 5 biggest directories SELECT parentpath, sum(bytes)/1e9 as GB FROM files GROUP BY parentpath ORDER BY sum(bytes) DESC LIMIT 5; Crab provides a virtual table, fileslines, which exposes file contents to SQL e.g. Count TODO and FIXME entries in any .c files, recursively SELECT fullpath, count(*) FROM fileslines WHERE parentpath like '/Users/GN/HL3/%' and extension = '.c' and (data like '%TODO%' or data like '%FIXME%') GROUP BY fullpath; As well there are functions to run programs or shell commands on any subset of files, or lines within files e.g. (Mac) unzip all the .zip files, recursively SELECT exec('unzip', '-n', fullpath, '-d', '/Users/johnsmith/Target Dir/') FROM files WHERE parentpath like '/Users/johnsmith/Source Dir/%' and extension = '.zip'; (Here -n tells _unzip_ not to overwrite anything, and -d specifies target directory) There is also a function to write query output to file, e.g. (Win) Sort the lines of all the .txt files in a directory and write them to a new file SELECT writeln('C:\Users\SJohnson\dictionary2.txt', data) FROM fileslines WHERE parentpath = 'C:\Users\SJohnson\' and extension = '.txt' ORDER BY data; In place of the interactive prompt you can run queries in batch mode. E.g. Here is a one-liner that returns the full path all the files in the current directory C:> crab -batch -maxdepth 1 . "SELECT fullpath FROM files" Crab SQL can also be used in Windows batch files, or Bash scripts, e.g. for ETL processing. **Crab is free for personal use, $5/mo commercial** See more details here (mac): [http://etia.co.uk/][1] or here (win): [http://etia.co.uk/win/about/][2] An example SQLite database (Mac data) has been uploaded for you to play with. It includes an example files table for the directory tree you get when downloading the Project Gutenberg corpus, which contains 95k directories and 123k files. To scan your own files, and get access to the virtual tables and support functions you have to use the Crab SQLite shell, available for download from this page (Mac): [http://etia.co.uk/download/][3] or this page (Win): [http://etia.co.uk/win/download/][4] # Content FILES TABLE The FILES table contains details of every item scanned, file or directory. All columns are indexed except 'mode' COLUMNS fileid (int) primary key -- files table row number, a unique id for each item name (text) -- item name e.g. 'Hei.ttf' bytes (int) -- item size in bytes e.g. 7502752 depth (int) -- how far scan recursed to find the item, starts at 0 accessed (text) -- datetime item was accessed modified (text) -- datetime item was modified basename (text) -- item name without path or extension, e.g. 'Hei' extension (text) -- item extension including the dot, e.g. '.ttf' type (text) -- item type, 'f' for file or 'd' for directory mode (text) -- further type info and permissions, e.g. 'drwxr-xr-x' parentpath (text) -- absolute path of directory containing the item, e.g. '/Library/Fonts/' fullpath (text) unique -- parentpath of the item concatenated with its name, e.g. '/Library/Fonts/Hei.ttf' PATHS 1) parentpath and fullpath don't support abbreviations such as ~ . or .. They're just strings. 2) Directory paths all have a '/' on the end. FILESLINES TABLE The FILESLINES table is for querying data content of files. It has line number and data columns, with one row for each line of data in each file scanned by Crab. This table isn't available in the example dataset, because it's a virtual table and doesn't physically contain data. COLUMNS linenumber (int) -- line number within file, restarts count from 1 at the first line of each file data (text) -- data content of the files, one entry for each line FILESLINES also duplicates the columns of the FILES table: fileid, name, bytes, depth, accessed, modified, basename, extension, type, mode, parentpath, and fullpath. This way you can restrict which files are searched without having to join tables. # Example Gutenberg data An example SQLite database (Mac data), _database.sqlite_, has been uploaded for you to play with. It includes an example _files_ table for the directory tree you get when downloading the Project Gutenberg corpus, which contains 95k directories and 123k files. You can open it with any SQLite shell, or query it with any SQLite query tools, but the virtual tables such as _fileslines_ and support functions such as EXEC() and WRITELN() only work from the Crab shell that you have to download from etia.co.uk. # Uses * Reporting and analysis of filesystem contents * Finding files and directories * Filesystem operations such moving, copying, deleting, unzipping files * ETL processing [1]: http://etia.co.uk/ [2]: http://etia.co.uk/win/about/ [3]: http://etia.co.uk/download/ [4]: http://etia.co.uk/win/download/

暂无相关内容。

分享你的想法

去分享你的想法~~

全部内容

欢迎交流分享

开始分享您的观点和意见，和大家一起交流分享.

数据使用声明：

一、数据来源与展示说明：

1、该数据来自于互联网数据采集或服务商的提供，本平台为用户提供数据集的展示与浏览。
2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
3、数据集基本信息来自数据原地址或数据提供方提供的信息，如数据集描述中有描述差异，请以数据原地址或服务商原地址为准。

二、所有权说明：

1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。

三、数据转载说明：

1、如您需要转载本站数据，请保留原数据地址及相关版权声明。

四、侵权与处理说明：

1、如本站中的部分数据涉及侵权展示，请及时联系本站，我们会安排进行数据下线。

所需积分：

0 去赚积分？

352浏览
0下载
0点赞
收藏
分享

Select Language

AI社区

今日排行

本月搜索

Dataset Category