11 dark secrets of data management

Hubert J. West

June 28, 2022

Some call data the new oil. Others call it the new gold. Philosophers and economists may argue about the quality of the metaphor, but there is no doubt that organizing and analyzing data is a vital undertaking for any business seeking to deliver on the promise of data-based decision-making. on the data.

And to do that, a solid data management strategy is essential. Encompassing data governance, data operations, data warehousing, data engineering, data analytics, data science, etc., data management, when done right, can providing businesses in all industries with a competitive advantage.

The good news is that many facets of data management are well understood and based on sound principles that have evolved over decades. For example, they may not be easy to apply or simple to understand, but thanks to scientists and mathematicians, companies now have a range of logistic frameworks for analyzing data and drawing conclusions. More importantly, we also have statistical models that draw error bars that delineate the limits of our analysis.

But for all the good that comes from studying data science and the various disciplines that power it, sometimes we still have a hard time scratching our heads. Businesses often come up against domain boundaries. Some of the paradoxes relate to the practical challenges of collecting and organizing so much data. Others are philosophical, testing our ability to reason about abstract qualities. And then there’s the rise of privacy concerns around so much data collected in the first place.

Here are some of the dark secrets that make data management such a challenge for many businesses.

Unstructured data is difficult to analyze

Much of the data stored in corporate archives doesn’t have much structure at all. A friend of mine aspires to use an AI to search through text notes taken by his bank’s call center staff. These phrases may contain information that may help improve the bank’s loans and services. Maybe. But the notes were taken by hundreds of different people with different ideas of what to write about a given call. Additionally, staff members have different writing styles and abilities. Some didn’t write much. Some write too much information about their given callings. The text itself doesn’t have much structure to begin with, but when you have a stack of text written by hundreds or thousands of employees over decades, the structure in place can be even weaker. .

Even structured data is often unstructured

Good scientists and database administrators guide databases by specifying the type and structure of each field. Sometimes, in the name of even more structure, they limit the values of a given field to integers within certain ranges or to predefined choices. Even then, people filling out forms stored in the database find ways to add kinks and problems. Sometimes fields are left blank. Other people insert a dash or the initials “na” when they think a question is not applicable. People even spell their names differently from year to year, day to day, or even line to line on the same form. Good developers can catch some of these issues through validation. Good data scientists can also reduce some of this uncertainty through cleaning. But it’s still infuriating that even the most structured tables have questionable entries – and that those questionable entries can introduce unknowns and even parsing errors.

Data schemas are either too strict or too loose

No matter how hard data teams try to state schema constraints, the resulting schemas for setting values in different data fields are either too strict or too loose. If the data team adds hard constraints, users complain that their answers are not on the restricted list of acceptable values. If the schema is too accommodating, users may add strange values with little consistency. It is almost impossible to adjust the scheme correctly.

Data laws are very strict

Privacy and data protection laws are strong and only getting stronger. Between regulations such as GDPR, HIPPA and a dozen others, it can be very difficult to piece together data, and even more dangerous to leave it lying around waiting for a hacker to break in. In many cases, it is easier to spend more money on lawyers than on programmers or data scientists. These headaches are why some companies simply get rid of their data as soon as they can get rid of it.

Data cleaning costs are enormous

Many data scientists will confirm that 90% of the work is simply collecting the data, putting it into a consistent form, and dealing with endless holes or errors. The person who owns the data will always say, “It’s all in a CSV and ready to use.” But they don’t mention empty fields or characterization errors. It’s easy to spend 10 times more time cleaning data for use in a data science project than just starting the routine in R or Python to perform the statistical analysis.

Users are increasingly suspicious of your data practices

End users and customers are increasingly suspicious of a company’s data management practices, and some AI algorithms and their use are only amplifying the fear, leaving many very worried about what’s to come. to data capturing their every move. These fears fuel regulation and often hook companies and even well-meaning data scientists in public relations. Not only that, but people deliberately confuse data collection with false values or wrong answers. Sometimes half the job is dealing with malicious partners and customers.

Integrating external data can be fruitful and lead to disaster

It’s one thing for a company to take ownership of the data it collects. IT and data scientists have control over this. But increasingly aggressive companies are figuring out how to integrate their local information with third-party data and the vast seas of personalized information floating on the Internet. Some tools openly promise to suck up data on each customer to build personalized records on each purchase. Yes, they use the same words as the spy agencies that pursue terrorists to track your fast food purchases and your credit scores. Is it any wonder that people worry and panic?

Regulators crack down on data use

No one knows when smart data analysis crosses a line, but once it does, regulators show up. In a recent example from Canada, the government explored how some donut shops were tracking customers who were also buying from competitors. A recent news Release announced: “The investigation revealed that Tim Hortons’ contract with a U.S. third-party location services provider contained language so vague and permissive that it allegedly allowed the company to sell ‘anonymized’ location data to his own ends.” And why? To sell more donuts? Regulators are increasingly paying more attention to anything involving personal information.

Your data schema might not be worth it

We imagine that a brilliant algorithm can make everything more efficient and profitable. And sometimes such an algorithm is indeed possible, but the price may also be too high. For example, consumers – and even businesses – are increasingly questioning the value of targeted marketing that comes from sophisticated data management systems. Some point out that we often see ads for something we’ve already purchased because ad trackers haven’t figured out that we’re no longer in the market. The same fate often awaits other clever stratagems. Sometimes a rigorous analysis of the data identifies the worst-performing factory, but that doesn’t matter because the company signed a 30-year lease on the building. Businesses need to be prepared for the likelihood that all that data science genius will produce an answer that isn’t acceptable.

In the end, data decisions are often just judgment calls

Numbers can offer a lot of precision, but how humans interpret them is often what matters. After all the data analysis and AI magic, most algorithms require deciding whether a value is above or below a threshold. Sometimes scientists want a p-value less than 0.05. Sometimes a cop is looking to give tickets to cars that are 20% over the speed limit. These thresholds are often only arbitrary values. Despite all the science and math that can be applied to data, many “data-driven” processes contain more gray areas than we’d like to believe, leaving decisions to instinct despite all the resources a company can. dispose. put into its data management practices.

Data storage costs are skyrocketing

Yes, hard drives keep getting bigger and the price per terabyte keeps falling, but programmers are collecting bits faster than prices can come down. Internet of Things (IoT) devices continue to download data and users expect to browse a rich collection of these bytes forever. In the meantime, compliance officers and regulators are increasingly demanding more data in the event of future audits. It would be one thing if someone actually looked at some of the stuff, but we only have so much time in the day. The percentage of data that is re-accessed continues to decline. Yet the price of expanding bundle storage continues to rise.