How to offload arrays in DataHub

Christoph_Souris · February 16, 2024, 9:44am

Product/components used and version/fix level:

DataHub, version 10.16

Detailed explanation of the problem:

So, I do have a fragment within our digital twin looking a bit like this:

"additional": [
  {
    "type": "A23",
    "number": "82746t1771"
  },
  {
    "type": "B12",
    "number": "1239656691"
  },
  {
    "type": "A08",
    "number": "123966919"
  }
]

With a varying number of array entries.

I want to offload this data to our data lake “somehow” but I don’t know how to deal with the array. Is it possible to collect array entries into two columns like:
additional_type and additional_number (each then holding all array entries as comma separated string)? Or maybe something like additional_type_1 and additional_number_1 etc. for individual values, accepting a maximum number of array entries?

Thank you and kind regards, Christoph

Michael_Cammert · February 20, 2024, 3:51pm

Hi!

The problem here is that with respect to additional columns, you have to configure them - this currently implies you know their number at design time of your offloading.

So, you can either just offload the whole arrays as one column and then take care of accessing the details when querying the data lake (the function flatten can be very handy for that). Or you can offload a fixed number of positions by accessing the array additional[0].“type”

Regards,
Michael

Christoph_Souris · February 21, 2024, 7:43am

Thanks Michael,

yes, unpacking the nested fields into individual arrays within the inventory collection would have been my preferred choice but this cannot be achieved directly. Together with Alexander Friedrich from Software AG we worked out these 3 options:

Option 1: Known number of array entries

Just as you stated - if the (max) number of array elements is known we can simply export specific indexes into named columns e.g. src.additional[0].type → additional_type_0 and so on.

This is definitely the easiest way to do this.

Option 2: Flatten entire array

It is possible to “flatten” the array during offloading which will produce multiple rows in the offloaded table. The target column type will be of type STRUCT. For example: If FLATTEN(src.additional) is offloaded to an additional column, I can later access individual values within this column like src.additional.type.

This has two disadvantages: A) multiple rows are being created which creates a bit of redundancy and B) all fields from the array are being offloaded. It is not possible to just select specific nested fields and they cannot be offloaded to individual columns either.

Option 3: Export the array as ARRAY/LIST and deal with it later

If the array fragment is offloaded as-is it will be a single column of type LIST. This list and the elements within (of type STRUCT) can be access it queries, though. We can use this to create a Virtual Data Set (VDS) which holds just specific nested fields of the array in a nice, normalized way.

In this (phony) example I would create an inventory_additional VDS which is based on the offloaded inventory collection like so:

SELECT o.id, o.lastUpdated, o.add.type, o.add.number
FROM ( 
   SELECT src.id, src.lastUpdated, FLATTEN(src.additional) as add 
   FROM inventory as src
) as o

Because the array/list is _flatten_ed, this will result in multiple rows in the inventory_additional view. This view can then be used in all kinds of queries.

I chose option 3 in the end, as in my actual data the array has many more nested fields (and I only needed a couple) and I could not assume a max number of array elements. The flattened VDS is pretty nice, getting the whole thing closer to a normalized database design.

Cheers! Christoph

system · August 19, 2024, 7:44am

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Offload custom event properties via DataHub Forum datahub	3	296	May 20, 2024
How to migrate data from external database to Cumulocity Forum	6	285	September 12, 2023
DataHub offload time configuration Forum	8	422	September 7, 2024
Getting Validation Error while configuring datahub offloading Job (Inventory) Forum cumulocity-edge , analytics , Streaming-Analytics-Apama , device-connectivity	8	283	November 18, 2024
Is it possible to export/import custom smart rest templates via rest/http Forum	4	493	March 15, 2023

How to offload arrays in DataHub

Product/components used and version/fix level:

Detailed explanation of the problem:

Related topics