The following text is an extract from Designing For Data Scientist published by Creme Global recently. This contents summarises research carried out by IBM in their paper The Risks of Using Spreadsheets for Statistical Analysis.
Spreadsheets are a ubiquitous tool in data science. Spreadsheets are not only used for data manipulation, they are also used for statistical analysis and even running basic models. A team might have a spreadsheet sheet with macros that run an analysis on data. The user can be anyone from a Data Scientist to an analyst working in marketing , finance or any industry that models data. The common activity of forecasting is essentially a type of modelling.
Users stretch the limits of spreadsheets to the point where they are used to perform tasks beyond their ability to perform correctly. IBM carried out useful research on this topic. Their findings are summarised in their unambiguously titled paper The Risks of Using Spreadsheets for Statistical Analysis. Here are some of the insights:
If we look at the type of errors spreadsheet users encounter we see more clearly the advantages of moving to a formal model
Functional Errors: These are errors that break the spreadsheet and give an error message.
Outlier Errors: The spreadsheet still works but an expert spots that these numbers can’t be correct.
Stealth Errors: These errors in results even pass the attention of experts. Often in statistics no one know what correct results should look like. Errors like these can go undiscovered.
“IBM studies reveal that 90 percent of all spreadsheets contain at least one error.”
Some of the primary causes include:
Logic: Relating cells incorrectly might mean a logic error. Using the wrong function or operator are logic errors.
Copying formulas: Copying a formula might fall foul of unseen cell relationships. Hand typing a formula is also dangerous.
Copying numbers on top of formulas: Copy a constant number over an invisible formula in a cell and the formula is wiped out along with other cell relationships.
Using the wrong function: Some spreadsheet functions have dangerously similar names.
Leaving out data: Most spreadsheets setups will not give validation error if you have not completed full rows of data and have left some cells out.
Adding incorrect data: Incorrect data entered can often pass unnoticed. Ideally the entry would cause an output effect that is very noticeable – for example accidentally entering a date rather than a number – however this is not always the case.
Excel patches: Tragically patches have caused errors to existing sheets as shown by this report.
Accuracy: Spreadsheet have been shown to have accuracy issues when executing complex maths, even when error free. Excel has gotten criticism for its shortcomings in statistical analysis.
Complexity: Spreadsheet analyses are often made up of chains of calculations. It is very easy to alter links in these chains without realizing and “breaking” the analysis.
“Professional statisticians continue to write books with titles like ‘Statistics with Excel,’ but they now warn students not to bet their jobs on Excel’s accuracy.”
Bruce D. McCullough, “The Unreliability of Excel’s Statistical Procedures,” Foresight, (February 2006) Vol. 3, 44-45.
Handling missing data values: A user here might try to enter a zero value which would upset the findings of a median value. If they enter a string it might be misinterpreted by an equation as a zero or something else entirely. One would at least need to come up with a standard for denoting missing values and stick to it. Mark Lambe here at Creme Global pointed out a related danger: “Excel’s default is to try to handle / interpret ambiguities rather than throw an error. This is a strength for simple stuff but becomes a disadvantage as complexity increases”
Categorical data with hidden meaning: Imagine you enter the value -1 or -2 in a cell to denote two different types of error. This categorical data needs to be documented somewhere. What if the author of the spreadsheet moves on?
Labour caused by focusing on cells: Spreadsheets are cell centric and modifying a cell can require the changing of formulas or data across a range of cells/rows/columns.
New data and equations: Equations can be set to extend automatically for new data, but sometimes this is not desired. Either way this will be a consideration and possible source of new errors.
Using spreadsheets for serious statistical work can be more trouble than it’s worth, and sometimes the user never even finds out about it, but reaps the negative consequences.
Furthermore in Excel the Data Scientist loses the reproducibility coding gives by way of versioning and pull requests.
For Designers working in data science, be aware of the ubiquity and consequences of spreadsheets and be able to bring this knowledge into your design workshops.