Get more in-depth insights about problems by combining data.
The first requirement is that you have datasets with a common identifier. This is because you need to join them together with this identifier.
Examples of common identifiers:
- Date (day, quarter, hour etc.)
- User, company or another entity identifier
- any other common dimension such as plan type, gender etc.
Make sure your identifier is consistent and reliable across all the datasets being joined, so run some tests to see if the identifier is valid. The tests will be a little different depending on your identifier, but some examples would be:
- Dates: are all dates represented in your data (if your data is daily for the past 2 years, are all the days present)?
- User, company, another entity identifier or any other attribute: Does each id/attribute only have one row of data, or are there multiple rows?
- Consistency: Are every dimension and metric present in each combination, or are some dimensions missing a metric or attribute?
- Aggregations: Is the data already aggregated and if so, for what period? Do you have the raw numbers to work with if you need to recalculate them?
Many datasets will be able to be fused on different dimensions, such as having a user_id and a timestamp in multiple datasets from different systems. Determining what value you want from the data will decide on how you need to blend the data together. To do that ask yourself the following question: What is the main outcome I want to achieve from the data I blend?
- I want to understand if my number of orders are dropping due to UX problems on the website from our last release.
- I want to decide on which feature to prioritize in product development based on the number of users interacting with it over the last 2 years.
- I want to inform my marketing manager about which types of campaigns led to higher churn in our product over the last 12 months.
You may have identified, in the previous steps, that for your outcome you will need to blend an attribute where not all values are present in all your data. However, even if you have all of your data for your attributes, it is best practice to still create a master that contains all the values separately, on which you will then join the rest of your data to.
For each identifier this process will be different, but the general process will be:
- Create a new tab in your spreadsheet workbook and name it Master.
- Create all the permutations of your attribute as rows in your first column.
- Dates: if you need all days of each year, write 2-3 days out and then drag it down until you have enough days for your date range.
- Other attributes such as plan type: write out each one manually or export it from a system (if available to you).
- User and other entity identifiers: usually it will not be possible to write these out manually; therefore you would normally copy + paste from all the individual datasets to obtain the full set for your particular use case.
- Do not add any other columns or metrics to this sheet (yet).
Take each dataset and ensure that the identifier you will be using is in the first column A and import it into one tab of your spreadsheet workbook. You will end up with each dataset on a separate tab and make sure to name them consistently, so you remember what they are!
You can find the VLOOKUP function in any spreadsheet software. Using this function, you can reference other sheets or data and find the appropriate value. This is how to apply this function. We will assume you have 3 tabs in your workbook: Master, Backend and Web_analytics.
- In your spreadsheet workbook, open the Master tab from the previous step and start in the second column B.
- Using the VLOOKUP function, reference your column A from Master and then use it to find the same identifier in your first dataset and return the value you want to work with.
- Repeat the lookup function for each value you want to blend together, dragging down the function for all the rows in your Master sheet.
- You may have errors like
#N/Awhere it cannot find the identifier in your dataset; you will need to decide how to handle these. For example, write 0 if the lookup value is a metric.
You can now use your fully blended Master tab with all your dimensions and values from multiple datasets to help answer the problem you were looking into. Most importantly, you can easily add more datasets or different perspectives to this workbook by following the steps:
- Add another tab named after your dataset.
- Ensure all values of your identifier are within the master data already, and if not, adding them.
- Repeat the VLOOKUP function in the Master tab for each value you want to include in your new dataset.
Last edited by @hesh_fekry 2023-11-14T11:10:37Z