You are here Analysis

Background collects a variety of salary and employee review information on a diverse set of companies. Employees use the site to look up salary information and reviews of companies they might be interested in joining, and can in turn provide Glassdoor with reviews of their current company.

By collecting and analyzing this information, we are able to make broad inferences about the health and general direction of a company.

Technical Details

We set up a few PHP web Spiders to navigate's website, which was relatively easy to do since Glassdoor’s site is still HTTP, and not HTTPS. If they upgrade to a secure site in the future, we will have to figure out a good way to link login info with our Spiders, since they will most likely limit the amount of data we are able to scrape. We search daily for new reviews for the companies we are monitoring. The reviews are formatted as JSON files and ingested into one of our Amazon Web Service (EC2) servers running Apache Spark on top of a single Hadoop node. Our analytics run over the data using Python with the PySpark API, and then the results are loaded into a simple Redis server for querying.

Getting Python, Spark and Hadoop to work together was a hassle even with the ready made spark-ec2 script that Spark provides, but we are very pleased with the performance and ease of use once we got everything up and running. Currently, we are doing trending analytics but hope to expand into more complicated Machine Learning and Graph processing algorithms, both of which Spark should handle very well once the MLlib (Machine Learning) and GraphX (Graph Processing) packages are installed and used.


Initially, we tried to find ~10 companies that had 1.) large changes in revenue or 2.) management gaffs or success stories in 2012 or 2013. We manually scraped all user data for these companies for the 2 years leading up to the change. After proving out a few algorithms with some rudimentary formulas in Excel, we decided that the project had potential and began doing live ingest for a select few companies.

After our small proof of concept, we began scaling the amount of companies for which we ingested data. Currently, as of Jan ‘15, we are collecting data on 78 companies indexed by the S&P500, as well as 14 additional companies that we are closely following on MVS. We are monitoring the data ingest rates into our Amazon Web Services EC2 compute cluster. Our main concern is that if we begin collecting and analyzing data on too many companies too soon, our cluster will get swamped, and we will have to pay too much for VM capacity as’s user base slowly increases (Spark can be memory intensive). We try to weigh opinions of employees differently in our analytics based on their rank in the company, salary, length of stay, experience etc., but it is difficult to create models with a ‘one size fits all’ approach for all the companies we monitor. We have begun doing separate analytics for different business types that we think might work, but may have to move onto more complex clustering approaches if our current process becomes too labor intensive.

Conclusions We’ve Reached

We have found that an increase in frequency of negative reviews on a company correlates to a flight of talent which can be a strong indication of underlying symptoms that aren’t readily apparent if you were were solely speaking with the company’s customers or analyzing the company’s 10Ks.

After analyzing a few “people-centric” companies (i.e. consulting firms), we have found that they tend to have the biggest correlation between company performance and the happiness of their low level employees (the day to day consultants that are earning the company the bulk of their income). However, the performance of companies that mainly sell a product don’t seem to be affected as much by the opinion of low level employees but are instead heavily influenced by management sentiment.

One thing we found interesting over the last few years is the success and outlook of IBM. The company had increasing revenue and profits, yet we were seeing massive increases in negative reviews regarding decreasing benefits and “clueless” management in their consulting segment. We ended up removing IBM from our Portfolio in 2012, but were hesitant to make any actions with real money since their consulting business’s health seemed separate from and thus might not reflect the overall health of the company, a mistake in retrospect. We do, however, feel like this tool will be a great way to evaluate other consulting companies moving forward.


We have been pleasantly surprised with our ability to predict if a company is heading downhill by analyzing changes in their employees’ sentiment. Unfortunately, we have had little luck predicting changes in a positive direction. We believe this could be due to simple selection bias; employees who are happy with their current position and company’s management generally will be less willing to look for a new position elsewhere, but haven’t done enough analysis yet to present a strong case in favor of this theory.

We are having a hard time matching up the titles of positions with equivalent positions at other companies. Certain industries such as finance and marketing seem to have a variety of job titles that do the same thing. Whereas, for example, in software companies, it seems like everyone has the title: “software developer,” which does not give a great representation of the employee’s level of control at the company.

© 2016 MVS Financial. All rights reserved.