How Jollyes runs the PetCLUB loyalty program for over 1.4 million members?
Read now
2024-09-18 12:00 am
2024-05-09 12:00 am
2024-03-18 12:00 am
2024-04-16 12:00 am
2024-04-14 12:00 am
2024-09-16 12:00 am
2024-06-25 12:00 am
2024-06-13 12:00 am
2024-06-17 12:00 am
2024-05-29 12:00 am
arrow pointing left
go to blog
Technical Aspects of Customer Segmentation Part 2 – Data Integrity
Mike Sedzielewski
Mike Sedzielewski
January 4, 2021
Share it on Twitter
Share it on Facebook
Share it on LinkedIn
Share it on Twitter
Share it on Facebook
Share it on LinkedIn

Technical Aspects of Customer Segmentation Part 2 – Data Integrity

The last article helped us understand how data synchronization enables effective segmentation. But to become a real segmentation wizard, you need one more thing — the ability to evaluate and maintain the quality of data, collectively defined as data integrity. While the best way to learn about this is through practice, wrapping your head around critical concepts will give you a head start.

Data integrity is a broad topic. Truth be told, it’s the paramount goal of the whole IT department. They are all hired to avoid drawing insights based on incorrect data. 

When we narrow it down to customer segmentation, a CRM specialist should be aware of data uniqueness, data types, and data constraints. 

Data uniqueness

Imagine a customer getting a message twice or, worse, getting a message twice each time with a different discount. A single case won’t cause much harm, but if this number scales, the campaign might be an instant failure with long-term consequences, including the loyalty drop.

This is why your customer database has to take care of data uniqueness. To do so, you need to find an identifier that lets you distinguish between two customers. It might be an email or a phone number. Although they are unique for every Tom, Dick, or Harry, they are not good candidates for an identifier. First of all, email addresses and phone numbers might change for a person. Second, you might have problems ensuring uniqueness for emails, including “.” and “+” or non-English characters.

That’s why CRMs use a unique internal identifier. It’s coined as a universally unique identifier (UUID) or globally unique identifier (GUID)

These are algorithms that take the customer’s unique attributes (like email or phone number), encrypt them, and generate a string of random characters — unique for every customer. When John becomes your customer, CRM will generate an identifier for him, like 8f14a65f-3032-42c8-a196-1cf66d11b930. 

By creating such a long identifier, you reduce the risk of generating a duplicate id to almost zero. 

Data types

The format in which you store data might be another threat or help to data integrity. Imagine running a survey and asking your audience for a favorite smartphone — you want to send an offer to every customer who chose iPhone X during the promotion. 

When you read the list of replies, you see the following records:

  • iPhone X
  • iphone X
  • i Phone10
  • IPhone X
  • iphone 10 
  • Nokia 3310
  • And so on...

When creating a segment for “iPhone X” customers, the system will show only one customer instead of five. The solution to this issue is straightforward — just change the answer field type to a predefined list instead of an open text. But if you forget about it before a roll-out, you end up with lots of manual jobs on your plate.

Recommended data types

Simple yet troublesome mistakes like this can be avoided by understanding and double-checking the data model for every customer journey step. The first step is to visit the documentation about options your CRM offers to store and process data for every entity — it’s called a data schema.

Let’s explore how a data schema looks and how you can use its features to ensure data integrity. Data types go first. 

Data types

Data types are attributes of data that tell CRM how we can use the data or what operations we can do with them. Here’s the list of types you can find in pretty every CRM tool.

Primitives – basic types

  • Boolean – represents values true or false. For example, imagine an attribute: is_first_time_customer.
  • Integer – represents a number, positive or negative, that doesn’t have a decimal point. For example, in Salesforce CRM, integers have a minimum value of -2,147,483,648 and a maximum value of 2,147,483,647.
  • Decimal (float) – a number that includes a decimal point, e.g., 3.14159.
  • Character – a single letter or any character, including numbers (collectively called alphanumeric).
  • String – stores a string of any alphanumeric characters, such as a word, a phrase, or a sentence.
  • Date – a value that indicates a particular day.
  • Datetime – a value that indicates a particular day and time.
  • Blob – (from the binary large object) a collection of binary data stored as a single object. You can think of it as a single file (image, voice recording, movie, PDF, etc.) for simplicity.

Before we move to advanced types, let’s pause for a moment to gain some intuition on how to choose the right type. You’ve probably already noticed that each data type has two characteristics:

  • what kind of values it can represent, 
  • what are the minimum and maximum values it can store.

There are two rules of thumb to both of them:

1) When it comes to value representation – the more flexibility you have, the less data automation is possible, or better, the more work in software is necessary to process data. A simple example would be zip code in the US. If it’s a number, we can use ranges to infer the state (e.g., Alabama is 35801 through 35816). That would be impossible for the string.

Another good example would be our survey. If we wanted to count iPhone X variants with the open text version, we would need to tweak our query to include all the values. Plus, we would need to maintain the query – every time we find a new variant typed by a user, the query must be updated.

2) The second rule is about the minimum and maximum values. The bigger size you set up for an attribute, the more flexible it is. Now, you might be asking why not always use the biggest option? Because the larger size needs more computer memory necessary to process data, and this costs more. It might be negligible when you have hundreds of records, but when you grow to millions, your CRM instance might be slower to respond, or you’ll hit the limit and will be forced to upgrade to a higher pricing plan.

Composite – result from combining two or more primitives

  • Array – a group of primitives of any size. It’s usually represented as a series or primitives in brackets, e.g., [1, 3, 5, 13, 5].
  • Set – a group of primitives of any size but with unique values only [1, 3, 5, 13]. 
  • Enum – an enum (from enumerator) is a data type with values that each take on exactly one of a finite set of identifiers that you specify (it’s one we should use for our survey to avoid the mess!). 
  • Object – An object is a value that contains other values, typically in fixed numbers and sequences and typically indexed by names. The elements of records are usually called fields or members. Remember JSON examples from the first part, they are objects.

Data constraints

Data types already help us maintain order in data and prevent tedious data cleanup tasks, something we can act on with machines’ help. But we can do more than that with constraints.

Data constraints define specific properties that data must comply with. A reliable CRM system ensures that constraints hold at all times — i.e., when you create a new object or modify an existing one.

Have you ever received an email title Dear {}? This might be a result of forgetting about a NOT NULL constraint on the first name attribute.

Data Integrity issue – lack of constraints

Here is how this and other typical constraints work:

  • Not null – each value must not be “NULL,” which in plain English means it cannot be empty. 
  • Unique – value(s) must be unique for each object in a database. For example, if you want to identify customers by email or phone number, you should make this field unique to avoid duplicate messages and more severe problems.
  • Primary key – value(s) must be unique for each object and not be NULL. Most CRMs implement these constraints out of the box.
  • Foreign key – value(s) must reference an existing record in another object (via its primary key or some other unique constraint). Imagine that you find a gift card in your system, but it doesn’t have owner information. You hesitate to deactivate it because perhaps one of your customers got it and will be disappointed if it fails at the checkout. Adding a foreign key between a gift card and customer objects would solve this problem because the system will not create a card without the owner assigned or removing an owner from an existing card by mistake.
  • Check – an expression that must be true for the constraint to be satisfied. This is an umbrella constraint for conditions you can apply to attributes of specific data types. The following examples should help you grasp the concept:
  • Email (string) should comply with a specific pattern (read the wiki article to see that it’s more complicated than mandatory @ in the middle).
  • Age (integer) should be greater than 13.
  • The birthdate (date) should be in March.
  • Customer creation (date) should be before October 1st, 2020.
  • The last order (DateTime) should be not earlier than noon yesterday.
  • Phone number (string) should start with a country prefix.
  • Address (object) should consist of the street, number, city, country, postal code — which can have their “check” constraints” on their own.

Data normalization

There’s another concept you should be aware of to improve your understanding of data integrity, the data duplication part to be more specific. It’s called data normalization

As an example, let’s take a loyalty program. We need to group customers in tiers to perform some analytics later on. Picture a table keeping information about when a customer joined a particular tier.

Data normalization example

Instead of storing the name of the tier, which can be as long as Dunder Mifflin Golden Tier for every person, we simply hold a number that references a table of the program tiers. 

So, instead of holding 20000 Dunder Mifflin Golden Tier, we have 20000 references like #3. When you want to change the tier name for some reason, you need to update it only in one place, and the data integrity will be maintained. 

Now, let’s go deeper into a more advanced normalization concept. Let’s assume you want to monitor how a customer moved across different tiers. We could store it in the following table:

Customer | Tier | Tier Entered Date 

Mike Scott | DM Golden Tier | 21/07/2020

Mike Scott | DM Silver Tier | 04/06/2020

Jim Halpert | DM Bronze Tier | 17/06/2020

But it’s better to create three separate objects to store this, one table with the list of tiers, one table with the list of customers, and another one to connect them both. That gives us the most freedom we can get when changing customer’s information or tier information separately, and we have fewer duplicate data, of course.

Data Normalization – suggested data format

The default CRM objects usually follow data normalization concepts. Still, if you want to create some custom objects by extending the out-of-the-box schema, you should always keep data normalization at the back of your mind. 


When you find yourself working with a new CRM, visit their documentation to explore the data schema. Go through default objects to see what you can do out of the box. Learn data types and how you can act upon them with built-in mechanisms like Hubspot calculations, Salesforce Calculated Fields with Formulas, or how to present them with template languages like Liquid.

As an exercise, you can also visit Fabric’s article on building an ecommerce data schema to explore how a data integrity theory from this post applies to a real-life use case.

Share it on Twitter
Share it on Facebook
Share it on LinkedIn

Are you wasting time and money on digital promotions?

It’s time for a change.