The file contains data of companies. Each row is a company in your company database. As you are aware there are a lot of duplicated companies so your database are marking them as “Invalid” in Column J “Flag” in the old algorithm. The new algorithm has these “Flags” listed in Column K. So when it says Valid, that’s a company that is determined by the algorithm to be a good company + real company + not duplicated, to be kept in the database.
There are additional data in the file for each of the companies to help you evaluate the companies.
Some issues:
1) Some companies have many legitimate subsidiaries. Like Google and YouTube might be 2 companies
but YouTube is a subsidiary of Google.
What you have decided to do is that you want these to stay in your
database as 2 separate companies, if these 3 conditions are met:
a) the subsidiary is large and >$100M revenue,
b) the name of that company looks substantially different from the parent, and
c) that the identity of the subsidiary still exists because sometime the parent
company just absorbs the subsidiary into the parent company and the subsidiary
disappears ie their website no longer exists.
In the Google / YouTube example, all three conditions are met, so both Google
and YouTube are kept as different companies in your database.
2) There are many big companies that often have hundreds of subsidiaries that are all pretty much the same company. For example, Citibank can have many subsidiaries like Citibank Auto Loans, Citibank New York, Citibank Florida, and those typically look like the same company to most consumers, so you do not want to keep all those subsidiaries but just to keep the main parent company.
3) When we have multiple of the same company in our database that are exactly the same company, like the company name is the same and the url is the same. In those cases, you want to keep the company listing in the database that has the most information (e.g. revenue, employee #, etc.), the highest revenue, etc, and remove the ones with less.
4) There are often wrong/incorrect information and so of course you want to keep the database listing with the most accurate information.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/6034e8a9.1c69fb81.f66f4.932e%40mx.google.com.