The term “big data” refers to datasets that are larger than the typical underlying data for information management and analysis. By themselves, Big Data algorithms arose with the introduction of the first high-performance servers (mainframes) that had sufficient resources for operational processing and were suitable for computer calculations and further analysis. Entrepreneurs and scientists are concerned about the quality of data interpretation, the development of tools for working with them, and the development of storage technologies. To this implementation, the introduction and active use of cloud storage and computing models.
Throughout its history until 2008. And growth continues exponentially. Currently, many companies are monitoring the development of it technologies.
When working with big data, mistakes cannot be avoided. You need to get to the bottom of the data, prioritize, optimize, visualize the data, get the right ideas. According to surveys, 85% of companies are striving for data management, but only 37% report success in this area. In practice, studying negative experiences is difficult because no one likes to talk about failures. Analysts will be happy to talk about successes, but as soon as it comes to mistakes, be prepared to hear about “noise accumulation”, “false correlation” and “random endogeneity”, and without any specifics. Do it problems really exist for the most part only at the level of theory?
Sampling errors
In the article “Big data: A big mistake?” remembered an interesting story with a startup Street Bump. The company invited Boston residents to monitor the condition of the road surface using a mobile application. The software recorded the position of the smartphone and abnormal deviations from the norm: pits, bumps, potholes, etc. The received data was sent in real time to the desired addressee in the municipal services.
However, at some point, the mayor’s office noticed that there were much more complaints from the rich regions than from the poor. An analysis of the situation showed that wealthy residents had phones with a permanent connection to the Internet, they drove more often and were active users of various applications, including Street Bump.
As a result, the main object of the study was an event in the application, but the statistically significant unit of interest was supposed to be a person using a mobile device. Given the demographics of smartphone users (at the time, they are mostly white Americans with middle and high income), it became clear how unreliable the data turned out to be.
The problem of unintentional bias has been wandering from one study to another for decades: there will always be people more actively using social networks, apps or hashtags than others. The data itself is not enough – the quality is of paramount importance. In the same way that questionnaires influence survey results, electronic platforms used to collect data distort research results by influencing the behavior of people when working with these platforms.
In DataScience UA, training should be based on the tasks assigned to the specialist. However, the tasks of a Data Scientist may differ depending on the field of activity of the company. Here are some examples:
- detection of anomalies – for example, non-standard actions with a bank card, fraud;
- analysis and forecasting – performance indicators, quality of advertising campaigns;
- scoring and grading systems – processing large amounts of data for making decisions, for example, on granting a loan.