The AI Playbook

Chapter 3: Data

Summary

  • For effective AI, develop a data strategy. A data strategy spans: data acquisition & processing; quality; context; storage; provisioning; and management & security. Define your data strategy at the outset of your AI initiative.
  • Accelerate data acquisition by using multiple sources. Developers draw on several sources including: free resources (such as dataset aggregators); partnerships with third parties (companies, universities, data providers and government departments); and create new, proprietary data.
  • A high-quality data set has appropriate characteristics to address your business challenge, minimises bias and offers training data labelled with a high degree of accuracy. Develop a balanced data set – if you possess significantly more samples of one type of output than another, your system will exhibit bias.
  • Primary forms of bias are: unwarranted correlations (between inputs and output classifications); erroneous assumptions which cause relationships to be missed (‘underfitting’); and modelling noise instead of valid outputs (‘overfitting’). Adjust for overfitting and underfitting by using different data volumes and model structures. Remove unwarranted correlations through testing.
  • Ensure that the results of your internal testing will be maintained when applied to real-world data. Test early, and frequently, on expected live data.
  • Managing ‘dirty data’ is data scientists’ most significant challenge (Kaggle). Smaller volumes of relevant, well- labelled data will typically enable better model accuracy than large volumes of poor-quality data. To label data effectively: consider developing a supporting system to accelerate data labelling and improve accuracy; draw on existing AI and data techniques; and seek data labelled by multiple individuals to mitigate mislabelling.
  • Understand the data you use. Ensure you capture the human knowledge regarding how your data was gathered, so you can make downstream decisions regarding its use. Capture data provenance (where your data originated and how it was collected). Define your variables (differentiate between raw data, merged data, labels and inferences). Understand the systems and mappings through which your data pass to retain detail.
  • Store and structure data optimallyto support your objectives. Storage options include basic file-based, relational, NoSQL or a combination. When selecting storage plan for growth in data volume, updates, resilience and recoverability.
  • One in three data scientists report that access to data is a primary inhibitor of productivity (Kaggle). Develop a provisioning strategy that: ensures data is accessible across your organisation when needed; contains safeguards to protect your company against accidents; optimises system input/output; and maintains data freshness.
  • Implement robust data management and security procedures consistent with local and global regulations. Personal data is protected by UK and EU law and you must store it securely. Draw on principles of appropriate storage, transmission and minimum required access.

Data: The Checklist

Formulate a data strategy

  • Develop a data strategy
  • Review your data strategy quarterly

Optimise acquisition & processing

  • Ensure your data collection is legal
  • Preserve detailed fields
  • Check you have included real world data

Develop a high-quality data set

  • Confirm you have enough examples of data classes for fair predictions
  • Understand variance in data required to solve your business challenge
  • Identify sources of bias in your data
  • Follow best practices for labelling data

Understand data context

  • Document the sources of your data
  • Add metadata to capture data collection methods

Store data optimally

  • Forecast expected growth in data
  • Evaluate methods of storage and access
  • Develop and test a resilience plan

Provision data appropriately

  • Ensure data requests do not block the addition of new data
  • Develop a plan to archive stale data so access remains fast

Optimise management & security

  • Ensure staff have the minimum access they require to perform their role
  • Use multi-factor authentication
  • Undertake regular penetration tests to validate your security
  • Appoint an individual responsible for compliance with legislation

A data strategy will enable your company to acquire, process, govern and gain value from data effectively. Without a data strategy, your team’s efforts will be greater than necessary, risks will be magnified and chances of success will be reduced.

Develop a data strategy for effective AI

“Data is the lifeblood of any AI system. Without it, nothing happens” (David Benigson, Signal). There are six components of an effective data strategy (Fig. 9.):

  1. Acquisition & Processing: Obtain and process the data you need to develop effective prototypes and algorithms.
  2. Quality: Develop a data set that has the appropriate characteristics to address your business challenge, minimises bias and offers training data labelled with a high degree of accuracy.
  3. Context: Understand the provenance of your data and the mappings through which it passes so you use and share it effectively within your company.
  4. Storage: Store and structure your data appropriately to support your objectives regarding access, speed, resilience and compliance.
  5. Provisioning: Optimise the accessibility of data to your team and the implementation of safeguards.
  6. Management & Security: Manage data security, access and permissioning to ensure appropriate use of your data stores.

Define your data strategy at the outset of your AI initiative. Review it quarterly and update it as product requirements change, your company grows or you are impacted by new legislation.

Fig. 9. The six components of an effective data strategy

Source: MMC Ventures

“Data is the lifeblood of any AI system. Without it, nothing happens.”

David BenigsonSignal

Accelerate data acquisition by using multiple sources

Obtaining data to develop a prototype or train your models can be a lengthy process. Ideally, you will possess all the data you need at the outset and have a data strategy to govern its access and management. In the real world, neither is likely. Working on the project may highlight missing data.

“Build access to data at scale from day one” (David Benigson, Signal). Filling the gaps from your own initiatives can take months, so use multiple approaches to accelerate progress. Developers typically draw on several approaches to source data (Fig. 10) including free resources (such as dataset aggregators), partnerships with third parties and the creation of new, proprietary data.

  • Use free resources: Evaluate data sources that already exist and are free to use. Kaggle (www.kaggle.com), a large community of data scientists and machine learning engineers, regularly posts data sources for competition experiments. These can be useful for prototyping and initial training of machine learning algorithms. Google Dataset Search (https://toolbox.google.com/datasetsearch) can help you find specific data sets – be they weather in London or public transport statistics for Manchester. Further, many authors of academic papers are now uploading sample code and data sets (either raw data or locations to acquire it) to platforms such as GitHub. These data sets are frequently used for benchmarking. Not all of the datasets from the above sources are free for business use, so check that your use of them is appropriate.
  • Develop partnerships: Develop partnerships with other organisations – other companies, universities, data providers or government departments. Establishing a mutually beneficial relationship can offer your company exclusive data and associated benefits.
  • Create data: The data you seek may be unavailable or prohibitively costly. You may need to invest time and resource to create the data you need – and a quarter of data scientists do so. The approach – embedding sensors, taking photos or videos, undertaking surveys or labelling existing datasets – will vary according to your industry and use case. Proprietary data is valuable – which is why so little is free. Developing your repository of proprietary data will yield value and defensibility over time.

You will need to de-duplicate and merge your data from multiple sources into a single, consistent store. New data must follow a comparable process so your data remains clean. If you merge fields, or decrease the precision of your data, retain the original data. Being able to analyse gaps in your data will enable you to plan future data acquisition and prioritise addressable business use cases.

“Build access to data at scale from day one.”

David BenigsonSignal
Fig. 10. AI developers use multiple approaches to source training data

Source: Kaggle

Develop a balanced, well-labelled data set

A high quality data set has appropriate characteristics to address your business challenge, minimises bias and offers training data labelled with a high degree of accuracy.

It is important to develop a balanced data set. If you possess significantly more samples of one type of output than another, your AI is likely to exhibit bias. You can decide whether your system’s bias will tend towards false positives or false negatives, but bias will be inevitable. There are three primary forms of bias in AI (Fig. 11):

Fig. 11. Three types of bias in AI

Source: Victor Lavrenko

  1. Unwarranted correlations between inputs and output classification. Systems that offer jobs based on gender rather than skills, or provide or decline financial products based on ethnicity, are examples of unwarranted correlations resulting from unrepresentative input data.
  2. Erroneous assumptions in learning algorithms, which result in relevant relationships being missed – so-called ‘underfitting’ (Fig. 12, overleaf). If underfitting, you have not sufficiently used the power of your data. If you seek to predict rental prices for properties, and base your model only on the number of bedrooms a property has, your predictions will perform poorly; your model will ignore important characteristics such as location, whether a property is furnished, and whether a property offers parking or a garden.
  3. Modelling noise instead of valid outputs – “overfitting”. An overfitted model takes account of so many details in the data that it cannot make accurate predictions. Considering all the health related data of a group of people, for example, will include so much natural variation in weights, blood pressures and general levels of fitness that predicting any characteristics or a new member of the group would be inaccurate.

Be aware of bias in your data and models to take appropriate action and minimise its impact. Overfitting and underfitting can be adjusted with different data volumes and model structures. Unwanted correlations are frequently more critical to the business; in addition to erroneous results they can lead to negative publicity. Test models thoroughly to ensure that variables that should not affect predictions do not do so. If possible, exclude these ‘protected variables’ from the models completely.

“If you possess significantly more samples of one type of output than another, your AI system is likely to exhibit bias.”

If the features you seek are rare, it can be challenging to achieve a balanced data set. You wish to develop a model that can deal with rare occurrences effectively, but not be overfit. You may be able to use artificial data, but not when the artefacts in the artificial data themselves impact the model. You may also choose to retain some overfit or underfit bias – and opt for a greater proportion of false positives or false negatives. If you err on the side of false positives, one solution is to let a human check the result. The bias you prefer – false positives or false negatives – is likely to depend on your domain. If your system is designed to recognise company logos, missing some classifications may be less problematic than incorrectly identifying others. If identifying cancerous cells in a scan, missing some classifications may be much more problematic than erroneously highlighting areas of concern.

Fig. 12. The problem of overfitting

Source: XKCD

It is critical to ensure that the results of your internal testing are maintained when applied to real-world data. 99% accuracy on an internal test is of little value if accuracy falls to 20% when your model is in production. Test early, and frequently, on real- world data. “If you don’t look at real-world data early then you’ll never get something that works in production” (Dr. Janet Bastiman, Chief Science Officer, Storystream). Before you build your model, put aside a ‘test set’ of data that you can guarantee has never been included in the training of your AI system. Most training routines randomly select a percentage of your data to set aside for testing, but over multiple iterations, remaining data can become incorporated in your training set. A test set, that you are sure has never been used, can be reused for every new candidate release. “When we’re looking at images of vehicles, I get the whole company involved. We all go out and take pictures on our phones and save these as our internal test set – so we can be sure they’ve never been in any of the sources we’ve used for training” (Dr. Janet Bastiman, Chief Science Officer, Storystream). Ensure, further, that your ‘test set’ data does not become stale. It should always be representative of the real-world data you are analysing. Update it regularly, and every time you see ‘edge cases’ or examples that your system misclassifies, add them to the test set to enable improvement.

Data scientists report that managing ‘dirty data’ is the most significant challenge they face (Kaggle). Smaller volumes of relevant, well-labelled data will typically enable better model accuracy than large volumes of poor quality data. Ideally, your AI team would be gifted data that is exhaustively labelled with 100% accuracy. In reality, data is typically unlabelled, sparsely labelled or labelled incorrectly. Human-labelled data can still be poorly labelled. Data labelling is frequently crowdsourced and undertaken by non-experts. In some contexts, labelling may also be intrinsically subjective. Further, individuals looking at large volumes of data may experience the phenomenon of visual saturation, missing elements that are present or seeing artefacts that are not. To mitigate these challenges, companies frequently seek data labelled by multiple individuals where a consensus or average has been taken.

To label data effectively, consider the problem you are solving. ‘Identify the item of clothing in this image’, ‘identify the item of clothing in this image and locate its position’ and ‘extract the item of clothing described in this text’ each require different labelling tools. Depending upon the expertise of your data labelling team, you may need a supporting system to accelerate data labelling and maximise its accuracy. Do you wish to limit the team’s labelling options or provide a free choice? Will they locate words, numbers or objects and should they have a highlighter tool to do so?

Embrace existing AI and data techniques to ease the data labelling process:

  • For visual classification use a generic object recognition tool, such as ImageNet, to identify relevant categories of images (such as cars) and the location of the object in an image. You can then show your labellers the image with a highlighted area and ask about the highlighted object to make a deeper classification (such as model).
  • For natural language processing you may be able to draw on existing textual content and classifiers, such as sentiment analysers, to sort data into broad categories that a person can verify and build upon for downstream applications.
  • Use clustering techniques to group large volumes of similar data that can be labelled together.

“If you don’t look at real-world data early then you’ll never get something that works in production.”

Dr Janet BastimanStoryStream

Understand data context by capturing human knowledge

It is critical to understand the data you use. Using a number labelled “score” in your database is impractical – and may be impossible if you do not know how it was derived. Ensure you capture the human knowledge of how data was gathered, so you can make sound downstream decisions regarding data use.

Your data strategy should ensure you:

  • Understand data provenance: It is imperative to understand where your data originated, how it was collected and the limitations of the collection process. Does data relate to current customers only, or a spread of the population? Are the images or audio you use raw or have they already been digitally edited?
  • Define your variables: Defined variables should enable you to differentiate between raw data, merged data, labels and inferences (such as assuming an individual’s gender from their title).
  • Understand systems and mappings through which data have passed. As you process data through multiple systems and mappings, problems can arise – much as photocopies of a photocopy begin to degrade. For example, if a system has a date of birth field which is imported into a system that requires age instead, the mapping will be accurate at the time of processing but information has been lost and the quality of the data will degrade over time. If this is then mapped to a system that uses an age range, accuracy will be regained but at the expense of precision. Ensure your mappings retain detail.

“Ensure you capture the human knowledge of how data was gathered, so you can make sound downstream decisions regarding data use.”

Understanding the context of your data will depend upon process and documentation more than tooling. Without an understanding of the context in which data was collected, you may be missing nuances and introducing unintended bias. If you are predicting sales of a new soft drink, for example, and combine existing customer feedback with data from a survey you commission, you must ensure you understand how the survey was conducted. Does it reflect the views of a random sample, people in the soft drinks aisle, or people selecting similar drinks?

It is important to understand the information not explicitly expressed in the data you use. Documenting this information will improve your understanding of results when you test your models. Investigating data context should prompt your employees to ask questions – and benefit from their differing perspectives. If you lack diversity in your team, you may lack perspectives you need to identify shortcomings in your data collection methodology. Ensure team members deeply understand your company’s domain as well as its data. Without deeper knowledge of your domain, it can be challenging to know what variables to input to your system and results may be impaired. If predicting sales of computer games, for example, it may be important to consider controversy, uniqueness and strength of fan base in addition to conventional variables.

Store and structure data optimally to support your objectives

Your data storage strategy will impact the usability and performance of your data. The nature of your data, its rate of growth and accessibility requirements should inform your approach.

Types of storage include basic file-based, relational and No Structured Query Language (NoSQL):

  • Basic file-based: Whether a cloud-based solution – such as Amazon Web Services (AWS) or HotBlob – or in-house, basic file-based storage has no limitations on file size – but is slow to search and search requests are typically based simply on file name, size or creation date.
  • Relational: Relational databases (including MySQL or Oracle) can store extensive information in separate tables related to one another. Relational databases are well suited to defined information, with strict definitions, that can be grouped into tables. While powerful in their ability to enable complex queries, and offering security down to the field level, relational databases can struggle with large data items (including images and documents) and prove challenging to scale.
  • NoSQL: Recently, NoSQL databases (such as Mongo or Redis) have become popular because they do not demand the field restrictions associated with relational databases. NoSQL databases are effective for storing large volumes of hierarchical data. Accordingly, they are commonly associated with ‘big data’ initiatives. NoSQL databases can • easily be scaled by adding extra machines to your system (‘horizontal scaling’), but struggle to enable complex queries due to the way in which they store data.

The store you select will influence the performance and scalability of your system. Consider mixing and matching to meet your needs – for example, a relational database of • individuals with sensitive information linking to data stored in a more accessible NoSQL database. The specific configuration • you choose should depend upon the data types you will store and how you intend to interrogate your data.

To plan for growth and updates:
  • Forecast increases in data volume. If starting with existing data, you will understand current data volumes and how much new data you are adding each day. If starting from scratch, you will need to estimate data growth based on a forecast of incoming data. Armed with an estimate of data growth, you can determine the volume of storage you will require for data for the first year of your project.
  • Cloud solutions will enable you to store as much data as you wish – but balance the cost of immediate– and long-term storage (on AWS, the difference between S3 and Glacier). If operating your own hardware, you will also need to decide whether to archive data away from your primary store. You may need to maintain physically separate data stores for select personal data, to ensure its isolation.
  • Monitor costs, remaining storage, and system performance so you can act before costs become prohibitive or you run out of storage space. For relational databases this is critical, because scaling is likely to require you to upgrade the hardware on which your database is operating. For NoSQL systems, it will be easier to scale horizontally.
For resilience and recoverability:
  • Treat resilience as mission-critical. Data is the most valuable component of your AI strategy; if your data were lost, you could not rebuild your models and would lose a significant proportion of your company’s uniqueness and value.
  • While large companies will have dedicated resources and specialist skills, startups and scale-ups must also plan for resilience and recoverability.
  • Ensure regular backups. Storage is inexpensive and accessible to every company.
  • The degree of resilience you require will depend upon whether it is critical for your data store to be permanently available for read and write. Resilient systems will duplicate your data, so a replica can take over seamlessly if part of your system fails. Further, resilient systems typically load balance to ensure multiple requests do not cause delays.
  • Many cloud providers offer resilient systems as part of their service. While most data centres have their own generators and redundant internet connectivity, significant events such as hurricanes and earthquakes can cause hours, or even days, of disruption. Other risks, including cascading software failures, can also crystallise. Depending upon the criticality of your data access you may also seek a separate provider, with a backup, that you can invoke in the event of a major disaster. If you manage your own data storage, you must manage recoverability as a minimum. Store backups in a separate geographic location and regularly test that you can restore them successfully. Your first disaster is not the time to learn that your backups have been failing silently.

“Your data storage strategy will impact the usability and performance of your data.”

When provisioning data consider access, safeguards and data freshness

One in three data scientists report that access to data is a primary inhibitor of productivity (Kaggle). Data provisioning – making data accessible to employees who need it in an orderly and secure fashion – should be a key component of your data strategy. While best practices vary according to circumstance, consider:

  • Access: Your data science team will become frustrated if they are waiting for another team to provide them with data. Providing them with tools for direct access may be valuable. Most data stores offer only full administrative access or expert level tooling. You may need to allow time and resource to implement a specific solution for your team.
  • Safeguards: Protect your company against accidents. Ensure data access is read-only. Except for an administrator, no-one should be able to delete or change data.
  • Input/output: Reading data from your systems must not block the addition of new data. Similarly, if your data store is being continually updated, your team should not have to wait for a significant period before they can extract the data they require.

Stale data can be a significantchallenge and is a key consideration when planning your provisioning strategy. If you are analysing rapidly-changing information, decide how much historical data is relevant. You might include all data, a specific volume of data points, or data from a moving window of time. Select an approach appropriate for the problem you are solving. Your strategy may evolve as your solution matures.

If you are correlating actions to time, consider carefully the window for your time series. If you are predicting stock levels, a few months of data will fail to capture seasonal variation. Conversely, if attempting to predict whether an individual’s vital signs are deteriorating, to enable rapid intervention, an individual’s blood pressure last month is likely to be less relevant. Understand whether periodic effects can impact your system and ensure that your models and predictions are based on several cycles of the typical period you are modelling. Pragmatically, ensure your access scripts consider the recency of data you require to minimise ongoing effort.

Implement robust data management and security procedures

Data management and security are critical components of a data strategy. Personal data is protected by UK and EU law and you must store it securely.

You may need to encrypt data at rest, as well as when transmitting data between systems. It may be beneficial to separate personal data from your primary data store, so you can apply a higher level of security to it without impacting your team’s access to other data. Note, however, that personal data included in your models, or the inference of protected data through your systems, will fall under data protection legislation.

Establish effective data management by building upon the principles of appropriate storage and minimum required access.

  • Physical Access: Direct access to your data store should be tightly limited to key, trusted individuals. Individuals with the highest level of access to your systems will frequently be targets for malicious third parties.
  • Users: Employees’ needs regarding data access will vary. If individuals do not need to view sensitive data, they should not have the ability to view or extract it.
  • Applications: Other systems that connect to your data store should also be treated as virtual users and restricted. Many companies fail to restrict application access and suffer adverse consequences when there is an error in a connected application or the application’s access credentials are compromised.

Additionally:

  • Use multi-factor authentication as broadly as possible.
  • Log every access request with the identity of the requester and the details of the data extracted.
  • Hire a third party to undertake penetration testing to validate the security of your systems.

If an individual resigns, or has their employment terminated, immediately revoke access to all sensitive systems including your data. Ensure that employees who leave cannot retain a copy of your data. Data scientists are more likely to try to retain data to finish a problem on which they have been working, or because of their affinity for the data, than for industrial espionage. Neither is an appropriate reason, however, and both contravene data protection law. Ensure your team is aware of the law and that you have appropriate policies in place.