Lies, Damn Lies, and Data Visualization
Predictive Analytics in Healthcare: It's Not Happening
Predictive Analytics Key Component of Customer Experience Management
Predictive Analytics in Higher Education
Trends in Computational Healthcare Research and Digital Therapeutics
Damion Nero, Director of HEOR Research Analytics, Cardinal Health
Transform Decision Making with a Data-Driven Culture
Kamelia Aryafar, Chief Algorithms Officer, Overstock
Leveraging Predictive Analytics to Power Customer Data &...
Dean Abbott, Co-founder & Chief Data Scientist, SmarterHQ
Thank you for Subscribing to CIO Applications Weekly Brief
Data Lake: Building a Bridge between Technology and Business
By Shaun Rankin, Ex SVP Data Management, Citizens Bank
A production environment, where the business operates to meet the needs of its customers, requires stability and controls to support the business function, for which it is designed. This level of stability will help support teams succeed in achieving required service level agreements.
A data lake supporting analytics is an organic, iterative, environment supporting the testing of hypotheses. Once the hypothesis is proven, the “EURAKA” moment, that insight needs to be shared. Analysts can additionally bring their own data and mix it with the production data. The needs change on a daily basis. The process is not definable to the level required when designing a stable production application.
Establishing self-service analytics with sandboxes in the production environment can help bridge between traditional production data environment and the organic data blending required to support analytics. Allow the analysts to load data into a dedicated workspace that can be joined to the production data in the data lake. Self-service capabilities allow for the capture, blending and sharing of information without running a software development project. Some mature organizations have implemented business driven schedules for implementing repeatable processes.
Controls can be put in place on the sandboxes, 90 day expiration, charge-back methods, etc. to help establish outer boundaries for compute, storage, and network traffic.
Controlling access to data tends to occur within technology, sometimes as part of a project, following information security guidelines. This has a tendency to design access control from an application, or from a source perspective. This model falls down in several key areas.
• Technology will design access within a project and may not have a deep enough understanding of business utilization to appropriately assess the inherent risk. This drives towards tighter controls than are necessary when balanced with the level of risk.
Segmenting your data lake to support different analytic needs will be helpful in driving success of the consumer
• Justification processes are performed at an individual basis which leads to inconsistent entitlements.
• Multitude of sources. Data Lakes need information from many sources. If each source has a different provisioning process, it could complicate provisioning in the lake.
One concept for addressing this is to ensure the appropriate first line of defense is defined. For most companies, first line of defense is the business or department directly, but not IT. This reduces, but doesn’t eliminate, the role that technology plays and can enable a more balanced approach to risk and control. This also elevates the accountability of the business in defining the right level of risk.
A second opportunity is to change the design from where data is sourced, to where it is consumed. This is commonly known as a ROLE based provisioning process. Aligning access control with a specific department, more closely ties the data needs to relevant policies and justifications. Role based design also eliminates the complexities for provisioning data from multiple sources.
Practical Application: By having a role called FINANCE, all GL, Transaction, and Account related information can be provisioned. Most roles in finance would have a consistent business need to know. Applicable relevant policies, ie: Sarbanes-Oxley would be applied to anyone with this role. On the other hand a second role called MARKETING, could require masked customer data, with relevant Privacy or GLBA policies implemented with this role.
One Size Does Not Fit All
Each usage of data has a unique context. The performance, quality, ease of use, cost, speed to decision, all can vary. Regulatory reporting may require a GL reconciled data mart with full lineage from when data is acquired to when it is put on a report. Hurricane Katrina is coming, do we have any customers, or inventory, or facilities that will be impacted, has a different level of urgency. The data quality can be best at hand.
Segmenting your data lake to support different analytic needs will be helpful in driving success of the consumer. Many analysts are used to data wrangling with data from disparate sources. Some information consumers need a more structured organization of the data assets.
Plan for a variety of tiers, perhaps starting with three, RAW, FULLY ORGANIZED, and in between. Be prepared to allow for using data across sandbox, raw, fully organized, etc.