I work with UK company information on a daily basis, and I thought it would be useful to publish a list of all active companies, in a way that could be used for machine learning.
There are 3,838,469 rows in the dataset, one for each active company. Each row, has the company name, date of incorporation and the Standard Industrial Classification Code.
The company list is from the publicly available 1st November 2017 Companies House snapshot.
The SIC code descriptions are from the gov.uk website.
In the file AllCompanies.csv each row is formatted as follows:
- CompanyName - Alpha numberic company name
- IncorporationDate - in British date format, dd/mm/yyyy
- SIC - 5 digits or if not known, None - see separate file for description of each code.
Possible uses for this data is to use ML to suggest a new unique but suitable name for a company based on what other companies of the same SIC are called.
Perhaps analyse how company names have evolved over time.
Using ML, perhaps determine what a typical company name looks like, maybe analyse if company names have got longer or
more complicated over time.
I am sure there are many more possible uses for this data in ways, that I cannot imagine.
This is my second go (the first was published a few hours ago) at publishing a dataset on any medium, so any useful tips and hints would be extremely welcome.
Links to the raw data sources are here:
- Companies House http://download.companieshouse.gov.uk/en_output.html
- SIC Codes https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic
