The importance of clean data is well established, in part because the consequences of dirty data are so severe. Dirty data—or information that is inconsistent, incomplete, or inaccurate—leads to wasted resources, productivity declines, bad communication and misspent ad dollars. Although data cleansing is essential, it has to be understood in context of a broader data governance strategy if data quality is expected to be maintained long term.
What is data cleansing and why is it important?
The magnitude and unending nature of data cleansing is articulated well by IBM’s application of the “80/20 rule” to data science that says, “80 percent of a data scientist’s valuable time is spent simply finding, cleansing, and organizing data, leaving only 20 percent to actually perform analysis.”
Most data scientists are hired by companies with grand plans to build algorithms and machine learning models that will inform (or, just maybe, transform) business strategy. In reality, data scientists often wind up confined to the repetitive tasks that come with cleaning and maintaining data quality so that at least some commercial value can be extracted from their employer’s datasets.
The importance of data cleaning comes from the outsized role data quality plays in a company’s ability to thrive in an increasingly data-driven world. Ideally, artificial intelligence will soon be sophisticated enough to handle the heavy lifting currently burdening data analysts and scientists. Until that happens, keeping the 2.5 quintillion bytes of data humans create each day clean will remain a time consuming, mission critical priority for most businesses.
Importance of data cleaning in analytics
Data-driven decision making is informed by analytics that depend entirely on clean and reliable data. However, most companies collect data across departments using different systems and by defining different variables. This makes it difficult to easily combine and analyze company data holistically. Businesses must have a plan for integrating data prior to collecting it, or grapple with the costly and time-consuming processing of retroactively standardizing data for analysis.
Estimates put the average cost of preventing a data duplicate at $1, correcting a duplicate at $10, and storing duplicate data at $100. Accepting that more isn’t always better from the beginning can help mitigate data storage and correction costs. Rather than capture everything, determine your business strategy first then decide what data is required to monitor its success with analytics.
Data cleansing steps and strategy
Any process for data cleansing should fit into a larger data governance strategy. At a company level, data governance involves identifying the people, processes and tools that will help with managing and enforcing rules or best practices to keep data assets secure, clean and useful.
Apply a governance model to data cleansing in three steps:
Determine who will own data cleanliness
In order to get your data clean, and keep it that way, you’ll need a team or individual (depending on the size of your company) that takes ownership of data quality. When outlining the responsibilities of the data quality team, remember that data cleansing isn’t solely the responsibility of data analysts or other employees in technical roles. Technical employees responsible for managing datasets may have to educate managers across departments about the value of sourcing high-quality data at the point of collection.
Formalize the data cleaning process
It is essential to define a process for how data will be cleaned that includes both the steps involved and the cadence at which data needs to be processed. Precisely describing your company’s data cleansing process helps with:
- Identifying opportunities to make the process more efficient.
- Maintaining data quality in the event of an organizational transition.
- Automating repetitive tasks to either reduce the level of effort required for human workers or eliminate their involvement entirely.
Leverage technology to help maintain data quality
As evidenced by steps one and two, there is no plug-and-play tech solution for data quality governance. People, whether we like it or not, are still very much involved in this process. However, data management software can simplify the oversight required for maintaining large quantities of data. Features like multi-user access control, which enables multiple users to access a database simultaneously without risking the integrity of the entire dataset, help keep data clean and secure without restricting access to select employees.
Ultimately, proper data governance requires identifying the people, process and technology that create and control data assets throughout your entire organization. While this may seem like an insurmountable task depending on the size of your business, we have a few tips to help facilitate the process.
How to develop a data governance model
Identifying all of the people, processes and technology that are involved in managing your data is no minor undertaking. However, Gartner research (available to clients) outlines a few key considerations for developing a data governance model that simplifies things:
Organizational structure and data silos
One of the first considerations any business has to make when forming a system for data governance is how and where different departments store their data. Companies with departmental silos aiming to centralize their data assets face an uphill battle compared with businesses that keep their data clean and standardized.
If data silos are necessary, a process must exist for cross departmental data sharing. By looking at data through the lens of your organizational structure you’ll be able to anticipate the concerns or requirements various departments will have, and decide how your data governance plan will accommodate them.
State of company data
Depending on the industry and specific operations of your company, there may be certain compliance requirements your organization needs to consider that impact data governance. Industries that handle confidential information (e.g. healthcare, insurance) may need to restrict employee access to data or meet specific physical, network and process security requirements.
A successful data governance plan needs buy-in from all employees that will play a role in managing data. For example, if your organization is structured such that IT involvement is required to process certain types of data, a successful data governance plan will require cross departmental buy-in from leadership.
Additionally, maintaining data quality depends on individual contributions from people throughout your company. Employees will be far more likely to comply with data governance rules if they understand the value of the plan and believe it will work.
3 different ways to approach data governance
While governance sounds severe, it is meant to be a flexible tool that optimizes for high-quality data that is stored efficiently and securely. Below are a few approaches that provide a framework for building your own data governance model:
1. Siloed data governance
In a siloed environment data is intentionally kept separate between departments, with collaboration and data sharing across functions only happening when necessary. Department liaisons are typically required to bridge the information divide between departments, taking responsibility for communicating data needs and mitigating the downsides of data silos.
Most useful for: Companies where it is a struggle to get cross departmental employee buy-in, individual departments own the data they need access to, and the goals of governance do not require merging datasets across departments.
2. Coordinated data governance
Taking a coordinated data governance approach also requires department leaders to act as liaisons. However, a formal mechanism exists to identify and accelerate cross departmental collaboration where it makes sense. Representatives from throughout your organization should act on a governance board that sets strategic goals, delegates responsibilities, and forms consensus around decision making.
Most useful for: Companies that know some datasets need to be merged (e.g. customer data spread across sales, service and marketing), but don’t want to permanently remove departmental barriers that protect data integrity and security.
3. Stand-alone information governance
A stand-alone data governance team is created to handle all governance activities. Employees outside of the stand-alone governance team will have limited access to company data assets and request information on an as-needed basis. Data governance initiatives that are centered around meeting compliance requirements should consider a stand-alone approach.
Most useful for: Companies that operate in industries that handle confidential information or that view their data as a highly protected company asset.