There’s no shortage of news pieces from prominent media outlets reporting on false information. Either the articles/pieces themselves are completely made up, the data they use is skewed and unreported, or the data is just totally made up completely. Well, how do we really know what is true and what is made up? There’s probably a model in psychology and/or economics based on this pointing out how circularly destructive false information is on the reported and the reporter (and outlet that reported); not to mention how social media exacerbates false information!
I’ve read an interesting use case scenario written by data.world promoting their data set hosting platform (http://www.kdnuggets.com/2016/09/behind-dream-data-work.html). The case itself was decent, BUT it reminded me of the saying: "Garbage In, Garbage Out", wherein the quality of findings is only as good as the quality of the underlying data. Makes sense. But it really made me think about how the data we use is actually collected.
I appreciate websites like data.world; I hope they continue really because practicing data analysis with data is important especially for beginners. One specific feature on data.world is interesting: verified sources, where the source of a data set (the person that posted your brand new .csv file full of information) shows up as verified if they are a verified member (i.e. Professor at the University of Washington, or Researcher at the Center of Disease Control). I guess we can think about it like Twitter’s verified user tag, but useful. These users also document the data collection methodology, engage in helpful conversation, and answer questions other users may have on its collection. I just hope that other practitioners don’t rely on this information to make decisions.
Prediction alert: I predict that other users will rely on the information to make decisions. And furthermore, these users will increasingly be making the wrong decisions based on poor data.
This isn’t surprising. Unless you capture the data yourself, how can you really know if it’s any good? The Catch-22 is that there’s no possible way to capture all of the data yourself and so you rely on third-party information to make decisions. See where this cycle gets out of hand, creating an exponential number of problems?
Perhaps going forward we should all just use the disclaimer: "This report/decision was based on the findings that was created for this specific set of data, and may not apply with new sets of data" with requisite links and sources.
Personally, I prefer to think in broad strokes. If the data (generally) matches the consensus (generally), then the result/report/decision should be made in a general sense. Disclaimers would help here, and they are very important.