I work with UK company information on a daily basis, and I thought it would be useful to publish a list of all active companies, in a way that could be used for machine learning.
There are 3,801,733 rows in the dataset, one for each active company. The postcode which is included in the dataset has been geolocated, and the resultant latitude and longitudes have been included, along with the Standard Industrial Classification Code, and date of incorporation.
The company list is from the publicly available 1st November 2017 Companies House snapshot.
The postcode geolocations and SIC Codes are from the gov.uk website.
In the file AllCompanies.csv each row is formatted as follows:
- CompanyNumber - in the format of 99999999 for England/Wales, SC999999 for Scotland and NI999999 for Northern Ireland.
- IncorporationDate - in British date format, dd/mm/yyyy
- RegisteredAddressPostCode - standard British format Postcode
- Latitude - to 6 decimal places
- Longitude - to 6 decimal places
- SIC - 5 digits or if not known, None - see separate file for description of each code.
Possible uses for this data is to see where certain types of companies are located in the UK, and how over time they multiply and spread throughout the UK.
Training ML algorithms to predict where there are a high (or low) density of certain types of companies, and where would be a good area for a company to be located, if it wanted minimal competition, or the inverse, where there are clusters of high densities, where it might be easier to recruit specialised staff.
A useful addition would be to overlay population density, which I am currently working on as an option for this dataset.
I am sure there are many more possible uses for this data in ways, that I cannot imagine.
This is my first go at publishing a dataset on any medium, so any useful tips and hints would be extremely welcome.
Links to the raw data sources are here:
- Companies House
- Postcode to Geolocation
- SIC Codes
