Get started
IMPACT 2025
Resources/Blog/

What does it take to talk to your Industrial Data in the same way we talk to ChatGPT?

What does it take to talk to your Industrial Data in the same way we talk to ChatGPT?

  • Generative AI
  • Data Contextualization
  • Digital Twin
  • Operational workflows

Published at: 4/18/2023, 1:26:00 PM

Team Cognite

Cognite

Retraining LLMs to understand the industrial domain like the public internet

Teaching the new dog new tricks

The excitement and innovation surrounding Generative AI and Large Language Model (LLM) solutions like ChatGPT drive the expectation for an industrial digital transformation iPhone moment. These LLMs are the result of training a machine learning model on a large corpus of text data to generate and understand natural language. This unprecedented leap forward in Natural Language Processing allows Generative AI systems to consume, understand, and provide insights into accessible content.

By virtue of their pre-training to perform NLP tasks, LLMs have a vast knowledge base to draw from. However, the content in an LLM’s data stores may be dated (i.e., pre-Sep. 2021) and based solely on content from the public domain. This can limit the source data available for generating a response and potentially lead to out-of-date info or ‘creative’ answers to make up for the information gap (hallucinations). If we can ‘train’ an LLM like ChatGPT on curated, contextualized industrial data, then we could talk to this data as easily as we converse with ChatGPT and have confidence in the basis of the response.

Context matters

The vast data set used to train LLMs is curated in various ways to provide clean, contextualized data. Contextualized data includes explicit semantic relationships within the data that can greatly affect the quality of the model’s output.

Contextualizing the data we provide as input to an LLM ensures that the text consumed is relevant to the task at hand. For example, when prompting an LLM to provide information about operating industrial assets, the data provided to the LLM should include not only the data and documents related to those assets but also the explicit and implicit semantic relationships across different data types and sources.

In the process of building the contextualized industrial knowledge graph, data is also processed to improve quality through normalization, scaling, and augmentation for calculated or aggregated attributes. For Generative AI, the old adage of Garbage-In → Garbage-Out applies. Aggregations of industrial data in large data warehouses and data lakes that have not been contextualized or pre-processed lack the semantic relationships needed to ‘understand’ the data and lack the data quality necessary for LLMs to provide trustworthy, deterministic responses.

How do we ‘teach’ an LLM industrial data

An LLM is trained by parceling text data into smaller collections, or chunks, that can be converted into embeddings. An embedding is simply a sophisticated numerical representation of the ‘chunk’ of text that takes into consideration the context of surrounding or related information. This makes it possible to perform mathematical calculations to compare similarities, differences, and patterns between different ‘chunks’ to infer relationships and meaning. These mechanisms enable an LLM to learn a language and understand new data that it has not seen previously.

When we ask an LLM a question (prompt) and provide it with additional information to consider when responding (inputs), it processes the prompt by encoding it into these numerical representations using the same techniques used during training. This numerical representation of the prompt is then mathematically compared to the stored embeddings it already ‘knows’ along with encoded embeddings for any additional content provided with the prompt (inputs). The LLM will retrieve embeddings (‘chunks’) deemed relevant and then use them as sources to generate a response.

Option 1 - The CoPilot approach

An Open Industrial Digital Twin comes to life with a comprehensive, contextualized knowledge graph of all industrial data related to an asset. By capturing both the asset data and the semantic relationships between the various asset data types and sources, it becomes possible to build a robust API library that can programmatically navigate and interrogate the asset’s Digital Twin.

Because LLMs like ChatGPT understand and can generate sophisticated code in multiple languages (i.e., Python, JavaScript, etc.), we can prompt the LLM with a question about our industrial data, and it can interpret the question, write the relevant code using Cognite Data Fusion’s APIs, and execute that code to return a response to the user (see below image for example).

These CoPilot-based approaches leverage the power of natural language to understand and write code based on published API documentation and examples. This is impossible with data lakes or data warehouses where, without a contextualized industrial knowledge graph, there are no API libraries that can be used as a reliable mechanism to access rich industrial data. Additionally, because all data access happens through the APIs, no proprietary data is shared with 3rd parties, and the built-in mechanisms for logging and access control remain intact.

Option 2 - Provide contextualized data directly to the LLM

API libraries available from OpenAI, langchain, and others allow us to leverage the power of the LLM’s natural language processing in conjunction with proprietary data. These libraries enable developers to take data that would normally exceed the limitations of GPT’s input text and perform the same tasks that an LLM would perform. Namely, parse contextualized industrial data into ‘chunks’ that can be turned into embeddings and stored in a private database.

This database can include numerical representations (embeddings) of specific asset data, including time series, work orders, simulation results, P&ID diagrams, as well as the relationships defined by the digital twin knowledge graph. Using these open APIs, we can then send a prompt to the LLM along with access to our proprietary embeddings database so that the LLM will formulate its response based on the relevant content extracted from our own proprietary knowledge graphs.

What does it take to talk to Industrial Data through Generative AI?

An Open Industrial Digital Twin is a prerequisite to enable Generative AI to understand and talk to your industrial data. The data models and contextualized relationships that drive the schema of a digital twin make it possible to provide open, API-driven access to industrial data that LLMs like ChatGPT can use to write and execute software in response to a prompt automagically.

Additionally, contextualized industrial data delivers not only the raw data related to our assets but also the relationships to additional data sources that enable and drive more deterministic interrogations of industrial data by LLM models.

With Generative AI powered by an Open Industrial Digital Twin, we can finally deliver next-generation Asset Performance Management solutions with cross-data-source insights. To be the first to learn more, sign up to our newsletter now and stay up to date on this fast-evolving transformative topic.

  • Blog - Generative AI

    Cognite Atlas AI Hackathon: 24 Hours of Rapid Innovation

  • Blog - Data Contextualization

    Reliability Redefined: Using Proactive Maintenance and Digital Workflows for Peak Performance

  • Blog - Data Contextualization

    Key Takeaways from Hannover Messe: AI + Knowledge Graphs and the Push for Interoperability

Want to learn more about our product?

Sign up for our monthly newsletter

Sign up today to receive new content, news, product updates and more, delivered directly to your inbox

Sign up for Cognite Newsletter

Your monthly Cognite news, product updates, and expert content

Product

Unique Value

Why Cognite

Strong Industrial Heritage

FAQ

Benefits

Digital Transformation Leaders

Executives

Operations Teams

IT Teams

Offering

Cognite Data Fusion®

Cognite Atlas AI™

Cognite Success Tracks

Get Started: Data Fusion Quick Start

Industrial Tools

Industrial Canvas

Field Operations

Maintenance

Robotics

Explore

Cognite Demos

Cognite Product Tour

Solutions

Industries

Upstream Energy

Downstream Energy

Continuous Process Manufacturing

Power Generation

Power Grid

Renewables

Solution areas

Advanced Troubleshooting

Field Operations

Data-Driven Turnaround Planning

Partner Ecosystem

Partners

Cognite Embedded

Customers

Success Stories

Value Review

Resources

Resources

All Resources

Webinars

LLM/SLM Benchmark Report

The Definitive Guide to...

... Industrial Agents

... Generative AI for Industry

... Industrial DataOps

Other

Company

About us

Newsroom

Careers

Leadership

Security

Ethics

Sustainability

Policies

Code of Conduct

Customer & Partner Privacy

General Privacy

Human Rights Policy

Vulnerability disclosure policy

Recruitment Privacy Notice

Report a Concern

Privacy PolicyTerms of Service

2016-2025 © Cognite AS. All Rights Reserved