Skip to main content

A Developer's Guide to nuvo's Cleaning Functions

Cleaning Functions

Data onboarding can be time-consuming and error-prone, especially when dealing with data in various formats and structures. The nuvo Data Importer is designed to streamline this process, providing a robust solution for automating data validation and cleaning.

This guide explores how developers can leverage nuvo's pre-built Cleaning Functions to enhance data import workflows, reduce manual labor, and ensure data quality.

Benefits of using nuvo's Cleaning Functions

nuvo's Cleaning Functions provide flexibility and robustness in handling diverse data validation and cleaning tasks. They help automate repetitive and error-prone tasks, reduce manual effort, and ensure data quality by covering a wide range of use cases. This flexibility is key to adapting to different data sources and requirements.

Here are some key benefits of using nuvo's Cleaning Functions:

  1. Automated Data Validation: nuvo's Cleaning Functions enable you to predefine rules, ensuring data integrity and consistency.
  2. Data Cleaning and Transformation: These functions can clean and transform data to meet your system's requirements, such as standardizing formats, correcting errors, and normalizing values.
  3. Error Detection and Handling: They allow you to detect errors in the input data and provide detailed error messages for debugging and resolution.
  4. Customizability and Extensibility: Developers can leverage nuvo's Cleaning Functions to address specific data cleaning needs and extend their capabilities.

Key functions for cleaning data with nuvo

Implementing Cleaning Functions makes it easy to ensure the accuracy and quality of your data imports. The nuvo Data Importer provides several key functions to help with this process: columnHooks, onEntryInit, and onEntryChange.

  1. columnHooks:
    • Purpose: columnHooks is used to define custom cleanings per column. This allows for targeted data transformations and validation.
    • When is it called: columnHooks is called after the "Match Columns"-step before the "Review Entries"-step is shown.
    • Parameters: Each cleaning function defined within columnHooks is set via the column key, and provides access to all values in the column.
  2. onEntryInit:
    • Purpose: onEntryInit is used to initialize each row after the data was mapped. This function is typically used to apply initial cleaning or transformations across columns of one row.
    • When is it called: onEntryInit is executed once for each row after the columnHooks callback, so after the "Match Columns"-step and before the "Review Entries"-step, but not when a row is created manually in the "Review Entries"-step.
    • Parameters: It receives the row object and the row index. This allows you to access and modify each row's data and handle initial validations or transformations.
  3. onEntryChange:
    • Purpose: onEntryChange is designed to handle changes to entries during the "Review Entries"-step. It ensures that any modifications to the data are validated and cleaned appropriately.
    • When is it called: onEntryChange is triggered whenever an entry is changed (e.g., through manual input, copy-paste, drag-and-drop, duplication, find&replace or ai cleanings). It is executed per entry change but only once for bulk changes.
    • Parameters: It takes an array of rows that have changed, with each row object containing the new data, the change log (old values), and the action type (edit, create, delete, replace, cleaningAssistant). This provides detailed information about each change, allowing for precise validation and cleaning.

These functions form a core part of nuvo's data cleaning capabilities, offering flexibility and control over data processing workflows. Once a user signs up and sets up the importer, implementing these Cleaning Functions ensures that data is validated, transformed, and enriched according to specific requirements. Learn more in the Cleaning Functions documentation.

Prerequisites

Before diving into nuvo's Cleaning Functions, ensure you have a basic understanding of the following tools and concepts:

  • A good understanding of TypeScript and React.
  • Familiarity with cloud services and APIs (e.g., Google Maps API for address validation).
  • nuvo account and a license key.
  • A CodeSandbox account for running sample code.

Installation

To implement nuvo's Cleaning Functions, follow these steps:

  1. Get your License Key

Licence key

  1. Install the nuvo Data Importer: Add the nuvo package to your project.
    npm install nuvo-react

OR if you use yarn

    yarn add nuvo-react
  1. Configure the nuvo Data Importer: Initialize and configure the importer in your project. Read the First Steps guide for detailed instructions. If you don't want to use React, the nuvo kowledge base contains a wealth of information on how to get started with your preferrred framework.

knowledge base

Getting started with nuvo's Cleaning Functions

This guide will focus on using nuvo's Cleaning Functions in a React application. If you don't have an existing project, you can fork this CodeSandbox starter file. It already has the UI setup and the dependencies installed.

Sample Use Cases

Let's explore several use cases and how to implement them using nuvo's Cleaning Functions:

Prevent importing duplicates (with server call)

Sometimes, when importing data into a database, it's essential to ensure no duplicate entries are added. This means, you need to check the incoming data against what's already in your database before importing it. To handle this kind of check, you can use nuvo Cleaning Functions.

Let's go ahead and use columnHooks to scan a single column for duplicates against existing database entries. For this specific use case, we have mocked a database to showcase how the procedure would work:

columnHooks={{
email: async (values) => {
let registeredEmails: string[];

await fetch("https://my-json-server.typicode.com/comdocks/nuvo/customers")
.then((response) => response.json())
.then((json) => {
registeredEmails = json.map((row: any) => row.email);
});

const duplicateErrors: any[] = [];

values.forEach((entry) => {
if (registeredEmails.includes(entry[0]?.toString() || "")) {
duplicateErrors.push([
{
info: [
{
message: "Duplicate entry. The email address already exists.",
level: "error",
},
],
},
entry[1], // don't forget to return the rowIndex!
]);
}
});

return duplicateErrors;
},
}}

Here's a quick breakdown of the columnHooks function that scans the email column to check for duplicates in the database:

  • The email column is specified directly as a parameter of the columnHooks function, making it the target column for this validation process.
  • An API call retrieves the existing email addresses from our database and stores them in the registeredEmails variable.
  • The function then iterates over the email column in the input data, checking each entry to see if it's already in registeredEmails.
    • If an entry matches an email in registeredEmails (i.e., it already exists in the database), an information object is added to the info property of that specific row, marking it as a duplicate.
    • In this structure, entry[0] represents the column's value, while entry[1] stores the rowIndex for pinpointing the duplicate entry.

Add this columnHooks function in the NuvoImporter component:

<NuvoImporter
licenseKey={"YOUR_LICENSE_KEY_HERE"}
settings={{
developerMode: true,
identifier: "test_data",
columns: [
{
key: "name",
label: "name",
columnType: "string",
},
{
key: "email",
label: "Email",
columnType: "email",
},
],
}}
columnHooks={{
email: async (values) => {
let registeredEmails: string[];
await fetch("https://my-json-server.typicode.com/comdocks/nuvo/customers")
.then((response) => response.json())
.then((json) => {
registeredEmails = json.map((row: any) => row.email);
});

const duplicateErrors: any[] = [];

values.forEach((entry) => {
if (registeredEmails.includes(entry[0]?.toString() || "")) {
console.log(entry);
duplicateErrors.push([
{
info: [
{
message: "Duplicate entry. The email address already exists.",
level: "error",
},
],
},
entry[1],
]);
}
});

return duplicateErrors;
},
}}
onResults={(res, errors, complete) => {
setResult(res);
complete();
}}
/>

Check out the complete code in this CodeSandbox. You can test the application using the data available for download within this sandbox.

Validate with cross-column regex patterns

Another common use case is adding details to one column based on information in another column of the same row. For example, we need to verify a specific code (like a VAT code) based on the country provided in another column. For this validation, we would need to access both the country column and the code column.

In this scenario, we need a function that can look at multiple columns within the same row simultaneously. While columnHooks is great for working with a single column, we'll use onEntryInit to access the data from multiple columns in the same row within one function.

First, let's say we have our validation schema stored in a variable, like in countryMapping shown below:

const countryMapping: any = {
Germany: {
regex: /^DE[0-9]{9}$/,
},
Nigeria: {
regex: /^[0-9]{8}-[0-9]{4}$/,
},
Ecuador: {
regex: /^[0-9]{13}$/,
},
Canada: {
regex: /^[A-Z]{9}$/,
},
};

We write an onEntryInit function that uses these countryMapping regex patterns to validate the code column. This validation will be based on the country specified in the country column.

onEntryInit={(row, rowIndex) => {
if (countryMapping.hasOwnProperty(row.country)) {
return {
country_code: {
value: countryMapping[row.country?.toString() || ""].code,
info: [
{
message: "This value was automatically generated.",
level: "info",
},
],
},
};
}
}}

Here's a quick breakdown of the onEntryInit function that uses countryMapping to validate and potentially auto-generate values for the code column:

  • The function verifies if the country value in the current row has a corresponding entry in the countryMapping object.
  • If a match is found, the function retrieves the appropriate regex pattern for that country from countryMapping.
  • The regex is used to validate the code column based on the specified format for that country.
  • The code column is then populated with an info object that includes a message, indicating that the value was "automatically generated."

Add this onEntryInit function in the NuvoImporter component:

const countryMapping: any = {
Germany: {
regex: /^DE[0-9]{9}$/,
},
Nigeria: {
regex: /^[0-9]{8}-[0-9]{4}$/,
},
Ecuador: {
regex: /^[0-9]{13}$/,
},
Canada: {
regex: /^[A-Z]{9}$/,
},
};

<NuvoImporter
licenseKey={"YOUR_LICENSE_KEY_HERE"}
settings={{
developerMode: true,
identifier: "contact_data",
allowManualInput: true,
columns: [
{
label: "Country",
key: "country",
columnType: "category",
dropdownOptions: [
{
label: "Germany",
value: "Germany",
type: "string",
},
{
label: "Canada",
value: "Canada",
type: "string",
},
{
label: "Nigeria",
value: "Nigeria",
type: "string",
},
{
label: "Ecuador",
value: "Ecuador",
type: "string",
},
],
},
{
key: "code",
label: "Code",
columnType: "string",
},
],
}}
onEntryInit={(row, rowIndex) => {
if ((row.country === "Germany" && !countryMapping["Germany"].regex.test(row.code)) || (row.country === "Nigeria" && !countryMapping["Nigeria"].regex.test(row.code)) || (row.country === "Ecuador" && !countryMapping["Ecuador"].regex.test(row.code)) || (row.country === "Canada" && !countryMapping["Canada"].regex.test(row.code))) {
return {
code: {
value: row.code,
info: [
{
message: "This value must correspond to the VAT format of the selected country.",
level: "error",
},
],
},
};
}
}}
onResults={(res, errors, complete) => {
setResult(res);
complete();
}}
/>;

To make sure this validation runs not only when data is first loaded but also after every manual change in the "Review Entries"-step, we'll add another cleaning function: onEntryChange.

Here's how it works: onEntryInit is called right before the user gets to the "Review Entries"-step, applying initial validations. But with onEntryChange, we can re-run the same validation every time a user makes a change. This keeps everything consistent and as required with each edit.

Just keep in mind that the structure of onEntryChange is a bit different from onEntryInit. If you want to dive into the specifics, check out the Cleaning Functions documentation for all the details.

Here's a code snippet of the onEntryChange function applying the same validation:

onEntryChange={(rows) => {
return rows.map((row) => {
if (
(row.data.country === "Germany" && !countryMapping["Germany"].regex.test(row.data.code)) ||
(row.data.country === "Nigeria" && !countryMapping["Nigeria"].regex.test(row.data.code)) ||
(row.data.country === "Ecuador" && !countryMapping["Ecuador"].regex.test(row.data.code)) ||
(row.data.country === "Canada" && !countryMapping["Canada"].regex.test(row.data.code))
) {
return {
rowIndex: row.rowIndex,
data: {
code: {
value: row.data.code,
info: [
{
message: "This value must correspond to the VAT format of the selected country.",
level: "error",
},
],
},
},
};
}
return row;
});
}}

Every time the user modifies either the country or code column during the "Review Entries"-step, onEntryChange runs to validate the code field according to the country-specific VAT format.

Add this onEntryChange function to the NuvoImporter component:

const countryMapping: any = {
Germany: {
regex: /^DE[0-9]{9}$/,
},
Nigeria: {
regex: /^[0-9]{8}-[0-9]{4}$/,
},
Ecuador: {
regex: /^[0-9]{13}$/,
},
Canada: {
regex: /^[A-Z]{9}$/,
},
};

const validateCountryCode = (country: string, code: string) => {
if (!country || !countryMapping[country]) return false;
return countryMapping[country].regex.test(code);
};

<NuvoImporter
licenseKey={"YOUR_LICENSE_KEY_HERE"}
settings={{
developerMode: true,
identifier: "contact_data",
allowManualInput: true,
columns: [
{
label: "Country",
key: "country",
columnType: "category",
dropdownOptions: [
{
label: "Germany",
value: "Germany",
type: "string",
},
{
label: "Canada",
value: "Canada",
type: "string",
},
{
label: "Nigeria",
value: "Nigeria",
type: "string",
},
{
label: "Ecuador",
value: "Ecuador",
type: "string",
},
],
},
{
key: "code",
label: "Code",
columnType: "string",
},
],
}}
onEntryInit={(row, rowIndex) => {
if (!validateCountryCode(row.data.country, row.data.code)) {
return {
code: {
value: row.code,
info: [
{
message: "This value must correspond to the VAT format of the selected country.",
level: "error",
},
],
},
};
}
}}
onEntryChange={(rows) => {
return rows.map((row) => {
if (!validateCountryCode(row.data.country, row.data.code)) {
return {
rowIndex: row.rowIndex,
data: {
code: {
value: row.data.code,
info: [
{
message: "This value must correspond to the VAT format of the selected country.",
level: "error",
},
],
},
},
};
}
return row;
});
}}
onResults={(res, errors, complete) => {
setResult(res);
complete();
}}
/>;

Check out the complete code in this CodeSandbox. You can test the application using the data available for download within this sandbox.

Automate cross-column enrichment

Automating the enrichment of one column using data from another column can be incredibly useful. For instance, we can enhance the code column by pulling information from a separate column containing the country's name.

In this example, we demonstrate how to retrieve a predefined country code from a predefined list. However, it’s also possible to fetch this data from a database or even a third-party API.

We have our country codes stored in a variable, countryMapping:

const countryMapping: any = {
Germany: {
code: "DE",
},
Nigeria: {
code: "NG",
},
Ecuador: {
code: "EC",
},
Canada: {
code: "CA",
},
};

Let’s define an onEntryInit function that loops through each row of the mapped data, checks the country column for the name of the country, and then returns the corresponding code for it:

onEntryInit={(row, rowIndex) => {
if (countryMapping.hasOwnProperty(row.country)) {
return {
country_code: {
value: countryMapping[row.country?.toString() || ""].code,
info: [
{
message: "This value was automatically generated.",
level: "info",
},
],
},
};
}
}}

Here's a quick breakdown of how the onEntryInit function works to automatically populate the code column based on the values in the country column:

  • When a row is initialized, onEntryInit checks if the value in country matches an entry in the countryMapping object.

  • If there’s a match, onEntryInit retrieves the corresponding country code from countryMapping and adds it to the code column.

  • Along with the country code, an info message is added, stating that this value was "automatically generated" to clarify that the code was added based on the values in the country column.

We’ll also define an onEntryChange function that applies the same procedure. This ensures that if the user updates a value in the country column the code column is updated as well:

onEntryChange={(rows) => {
if (countryMapping.hasOwnProperty(rows[0].data.country)) {
return rows.map((row) => {
return {
rowIndex: row.rowIndex,
data: {
country_code: {
value: countryMapping[row.data.country?.toString() || ""].code,
info: [
{
message: "This value was automatically generated.",
level: "info",
},
],
},
},
};
});
}
return [];
}}

Here's a quick breakdown of how the onEntryChange function works to keep the code column updated based on changes in the country column:

  • Each time a user edits the country column during the "Review Entries"-step, onEntryChange is triggered to re-validate and update the code column.

  • It checks if the new value in the country column has a corresponding entry in the countryMapping object.

  • If there’s a match, onEntryChange retrieves the corresponding country code from countryMapping and adds this code to the code column.

  • Just like in onEntryInit, an info message is added, noting that the value was "automatically generated," providing context of the update.

Add these Cleaning Functions to the NuvoImporter component:

const countryMapping: any = {
Germany: {
code: "DE",
},
Nigeria: {
code: "NG",
},
Ecuador: {
code: "EC",
},
Canada: {
code: "CA",
},
}

<NuvoImporter
licenseKey={"YOUR_LICENSE_KEY_HERE"}
settings={{
developerMode: true,
identifier: "contact_data",
allowManualInput: true,
columns: [
{
label: "Country",
key: "country",
columnType: "category",
dropdownOptions: [
{
label: "Germany",
value: "Germany",
type: "string",
},
{
label: "Canada",
value: "Canada",
type: "string",
},
{
label: "Nigeria",
value: "Nigeria",
type: "string",
},
{
label: "Ecuador",
value: "Ecuador",
type: "string",
},
],
},
{
key: "country_code",
label: "Country Code",
columnType: "string",
},
],
}}
onEntryInit={(row, rowIndex) => {
if (countryMapping.hasOwnProperty(row.country)) {
return {
country_code: {
value: countryMapping[row.country?.toString() || ""].code,
info: [
{
message: "This value was automatically generated.",
level: "info",
},
],
},
};
}
}}
onEntryChange={(rows) => {
if (countryMapping.hasOwnProperty(rows[0].data.country)) {
return rows.map((row) => {
return {
rowIndex: row.rowIndex,
data: {
country_code: {
value: countryMapping[row.data.country?.toString() || ""].code,
info: [
{
message: "This value was automatically generated.",
level: "info",
},
],
},
},
};
});
}
return [];
}}
onResults={(res, errors, complete) => {
setResult(res);
complete();
}}
/>

Check out the complete code in this CodeSandbox. You can test the application using the data available for download within this sandbox.

Address Validation using the Google Maps API

Let’s say you have a file containing multiple address entries, and you want to ensure each address is valid before importing it into your database. For this, you can use tools like the Google Maps API to verify addresses. With nuvo’s Cleaning Functions you can easily automate this validation, ensuring that every address is checked and verified before importing.

Here’s an example of how to do this using columnHooks:

columnHooks={{
address: async (values) => {
const transformedValues = await Promise.all(
values.map(async ([value, index]) => {
const currentAddress = `${value}`.replace(/\n/g, "");
const geocoder = new (window as any).google.maps.Geocoder();

const isValidAddress = await new Promise((resolve) =>
geocoder.geocode(
{ address: currentAddress },
(_results: any, status: any) => {
if (status === (window as any).maps.GeocoderStatus.OK) {
return resolve(true);
} else {
return resolve(false);
}
}
)
);

return [
{
value: value,
info: isValidAddress
? []
: [
{
level: "error",
message: "Invalid address",
},
],
},
index,
];
})
);

return Object.fromEntries(transformedValues);
},
}}

Here’s a quick breakdown of how the columnHooks function works to validate addresses using the Google Maps API:

  • The function sets up an asynchronous call for the address column, validating each address imported from the file.

  • Each address is cleaned by removing any newline characters to improve geocoding accuracy.

  • The function then geocodes the cleaned address using geocoder.geocode.

    • If geocoding is successful (status === google.maps.GeocoderStatus.OK), it considers the address valid.
    • If geocoding fails, it marks the address as invalid.
  • If there are no issues, it returns the original address with an empty info array.

    • If there are any issues, it adds an entry to the info array describing the error.

Add this columnHooks function to the NuvoImporter component:

<NuvoImporter
licenseKey={"YOUR_LICENSE_KEY_HERE"}
settings={{
developerMode: true,
identifier: "contact_data",
allowManualInput: true,
columns: [
{
key: "address",
label: "Address",
columnType: "string",
description: "The complete address.",
},
],
}}
columnHooks={{
address: async (values) => {
const transformedValues = await Promise.all(
values.map(async ([value, index]) => {
const currentAddress = `${value}`.replace(/\n/g, "");
try {
const geocoder = new (window as any).google.maps.Geocoder();
const isValidAddress = await new Promise((resolve, reject) =>
geocoder.geocode({ address: currentAddress }, (results: any, status: any) => {
if (status === (window as any).maps.GeocoderStatus.OK) {
return resolve(true);
} else if (status === (window as any).maps.GeocoderStatus.OVER_QUERY_LIMIT) {
return reject(new Error("API quota exceeded"));
} else {
return resolve(false);
}
}),
);
return [
{
value: value,
info: isValidAddress
? []
: [
{
level: "error",
message: "Invalid address",
},
],
},
index,
];
} catch (error) {
return [
{
value: value,
info: [
{
level: "error",
message: `Address validation failed: ${error.message}`,
},
],
},
index,
];
}
}),
);
return Object.fromEntries(transformedValues);
},
}}
onResults={(res, errors, complete) => {
setResult(res);
complete();
}}
/>

Check out the complete code in this CodeSandbox. You can test the application using the data available for download within this sandbox.

Note: This is a template of how such an implementation would be structured. You would need to insert your Google Maps API Key inside a Google script tag in index.html to ensure the sandbox functions correctly.

Make columns uneditable

In some cases, you need to lock certain values in the "Review Entries"-step to prevent user edits. This is typically done if these values are either calculated or pulled from an external server. To enforce this, we can use the onEntryChange function to disregard any changes the user attempts to make.

Let’s set up the onEntryChange function:

onEntryChange={(rows) => {
let initialValueArray: Array<any> = [];

rows.forEach((row) => {
if (row.changeLog) {
initialValueArray.push({
rowIndex: row.rowIndex,
data: row.changeLog,
});
}
});

initialValueArray.forEach((initialValueObj) => {
const rowIndex = initialValueObj.rowIndex;
const rowToUpdate = rows.find((row) => row.rowIndex === rowIndex);

if (rowToUpdate) {
Object.keys(initialValueObj.data).forEach((key) => {
const oldValue: any = {
value: initialValueObj.data[key],
info: [
{ level: "info", message: "This cell is uneditable" },
],
};
rowToUpdate.data[key] = oldValue;
});
}
});

return rows;
}}

Here's a quick breakdown of how the onEntryChange function works to prevent specific values being changed by the user:

  • The function creates an array (initialValueArray) to store the original values of all entries the user tries to change. It checks each row's changeLog property, which holds the initial values before any changes, and saves these in initialValueArray.

  • Next, the function loops through each entry in initialValueArray and finds the corresponding row in the rows array.

    • For each changed entry in the row, it reverts the value to its original value in changeLog.
    • It also adds an info message to each reverted entry, stating "This cell is uneditable".
  • Finally, the function returns the rows array, where any user changes to restricted entries have been reverted to their original values, ensuring that calculated or externally sourced data remains valid.

Add this onEntryChange function to the NuvoImporter component:

<NuvoImporter
licenseKey={"YOUR_LICENSE_KEY_HERE"}
settings={{
developerMode: true,
identifier: "contact_data",
columns: [
{
key: "name",
label: "Name",
columnType: "string",
description: "Full name.",
},
{
key: "email",
label: "Email",
columnType: "string",
description: "Valid email address.",
},
{
key: "phone",
label: "Phone",
columnType: "string",
description: "Contact phone number.",
},
{
key: "date",
label: "Date",
columnType: "string",
description: "Relevant date (YYYY-MM-DD).",
},
],
}}
onEntryChange={(rows) => {
let initialValueArray: Array<any> = [];

rows.forEach((row) => {
if (row.changeLog) {
initialValueArray.push({
rowIndex: row.rowIndex,
data: row.changeLog,
});
}
});

initialValueArray.forEach((initialValueObj) => {
const rowIndex = initialValueObj.rowIndex;
const rowToUpdate = rows.find((row) => row.rowIndex === rowIndex);

if (rowToUpdate) {
Object.keys(initialValueObj.data).forEach((key) => {
const oldValue: any = {
value: initialValueObj.data[key],
info: [{ level: "info", message: "This cell is uneditable" }],
};
rowToUpdate.data[key] = oldValue;
});
}
});

return rows;
}}
onResults={(res, errors, complete) => {
setResult(res);
complete();
}}
/>

Check out the complete code in this CodeSandbox. You can test the application using the data available for download within this sandbox.

Conclusion

In this guide we have explored various use cases for using nuvo's Cleaning Functions in the nuvo Data Importer. We have implemented different cleanings such as a duplicate validation, a cross-column regex validation, a backend validation, and an address validation using the Google Maps API.

These use cases demonstrate the flexibility and power of nuvo in handling complex data processing tasks. You can further experiment with other examples for using nuvo's Cleaning Functions in our knowledge base.

Next Steps

nuvo offers a wide array of tools to help you clean and process your data effectively, including advanced Cleaning Functions and data transformation techniques to enhance your data quality and consistency. These tools can assist you in refining your data and ensuring its accuracy and coherence, ultimately improving the effectiveness of your data-driven processes.

Resources