An Introduction to Automatic Knowledge Graph Construction

#knowledge-graph #llm

22/03/2025 11 Min.

Written by: Yuk

Table of Content

Overview

In this post, we are going to go through a primer on the concept of knowledge graphs, a form of structured knowledge representation that had its roots from the 70s. We will be covering its evolution from pure heuristic to LLM and AI driven work in recent years. More importantly, this post will be focusing on the topic of autonomous knowledge graph construction, a non-trivial task in the literature that is actively being researched.

Introduction

Knowledge Graph or KG is a form of knowledge representation that was originally developed for structured information and data. A notable early application of KG is modeling the World Wide Web (WWW) for search engines such as Google, which actually pioneered said concept in 2012. Other fields also had promising integration such as in medical inquiry, financial analysis, and most recently in AI systems.

Graph and Knowledge

If you are unfamiliar with the concept of a graph, it consist of nodes and edges, where nodes are connected to each other through edges which can either be directed or undirected (bidirectional relationship). While graphs originate from graph theory in discrete mathematics, knowledge graphs are one of the most novel real-world applications of this theory. Due to the flexibility and rigidity of the graph data structure, utilizing it to model knowledge in the real world was almost natural.

Not only can they elegantly express obvious systems such as social networks or logical frameworks, but also more abstract fields with distinct axioms such as genome (DNA) and chemistry in material science, extensive research has been performed to study the fascinating realm of KG in relation to modeling the world and its applications. When talking about knowledge graphs, you can think more concretely about them with the concept of triplets.

A triplet in a KG consists of a Head, Relation, and Tail, where the Head and Tail are nodes, and the Relation represents the edge connecting them. Usually, knowledge graphs are directed to more robustly model knowledge as often times A -> SOMETHING -> B does not necessarily translate to B -> SOMETHING -> A. More times than not, these Head, Tail, and Relation will have restriction regarding their node types and relation types, KG without such restrictions are usually called a heterogeneous knowledge graph in contrast to homogeneous knowledge graph.

Components of a Knowledge Graph

Ontology

Ontology in the context of knowledge graphs refers to the structured schema that defines entity types, relationships, and constraints. It provides the semantic backbone that allows a KG to maintain logical consistency and enables reasoning over its contents. The idea is that by enforcing a well-defined ontology, knowledge can be structured in a way that is both interpretable and machine-readable. In theory, this sounds like an essential component-without it, how can a KG even function properly? However, in practice, ontologies are often sparse, fuzzy, or even disregarded entirely.

Sparse Ontologies

While a perfectly defined ontology would be ideal, real-world KGs rarely achieve this. Designing a comprehensive schema is not just difficult-it often takes years, sometimes even decades, with domain experts continuously refining classifications, relationships, and constraints. Fields like medicine, law, and scientific research rely on meticulously crafted ontologies that evolve as new discoveries emerge. However, achieving an optimal ontology is about as feasible as mapping an entire ecosystem down to every microorganism-by the time you think you’ve captured everything, new variables emerge, forcing continuous revisions.

For beginners, working with ontologies can feel overwhelming. The fear of dismissing or misclassifying information can make the process intimidating, especially when dealing with ambiguous data. Unlike structured databases with clear categories, KGs often involve edge cases where multiple classifications could be valid, making it difficult to decide what should be enforced as “ground truth.” This hesitation can lead to either excessive caution-slowing down progress-or overly rigid structures that fail to capture the complexity of real-world knowledge.

Fuzzy Boundaries

Even when ontologies are present, they tend to be more fluid than rigid. Many relationships and categories are context-dependent, meaning that the same entity can belong to multiple types depending on the perspective. For example, a company like Apple could simultaneously be categorized under technology, retail, and entertainment, making strict ontological classification difficult. Similarly, some relationships are not clearly binary but exist on a spectrum, further complicating the structured nature of an ontology.

More importantly, an ontology doesn’t even guarantee that a knowledge graph will function as intended. The sheer number of moving parts in KG systems makes them feel brittle-small inconsistencies or missing relationships can ripple through the entire structure, causing unexpected failures or incorrect inferences. In reality, only organizations with significant resources and manpower can afford to maintain truly robust KGs, making their accessibility feel almost like an illusion. From the outside, KGs appear to offer a clean, structured way to model knowledge, but the deeper you go, the more evident it becomes that without a dedicated team to continuously refine and update it, a KG can just as easily become a barrier to progress rather than a tool for it. This has been my own experience-rather than enabling seamless knowledge representation, working with KGs often feels like struggling against the very system that was meant to simplify things.

Disregarded in Practice

With the rise of LLMs and embedding-based KG representations, some modern knowledge graphs move away from traditional ontology-based approaches altogether. Instead of relying on predefined schemas, these methods learn relational patterns from large datasets, often using vector embeddings or probabilistic reasoning. This shift allows for greater scalability and adaptability, making it possible to construct large, dynamic KGs without the bottleneck of manual ontology curation.

However, this approach introduces a paradox. The more heterogeneous a graph becomes, the closer it seemingly gets to representing the complexity of the real world-capturing nuance, ambiguity, and evolving relationships. Yet, at the same time, the reliance on probabilistic methods means that this “representation” is never truly ground truth but rather an approximation shaped by statistical correlations. The very techniques that enable flexible, large-scale KG construction also introduce interpretability issues, making it difficult to distinguish accurate knowledge from artifacts of the model.

In practice, this means that while modern KGs are more automated and scalable, they often sacrifice semantic reliability in the process. This raises a fundamental question: at what point does a knowledge graph stop being a structured knowledge system and start resembling just another probabilistic model?

Knowledge Graph Nodes

Nodes in a knowledge graph (KG) can take many forms, ranging from simple labels to richly attributed entities. At their core, nodes represent the entities in a KG, but what these nodes actually contain varies depending on the structure of the graph. Some nodes are strictly defined with labels and attributes, while others are more loosely connected, resembling free-text data rather than structured knowledge.

Broadly speaking, nodes in a KG can be categorized into three main types:

Unstructured Nodes – Free-text representations that lack formal schema constraints.
Structured Nodes – Entities with clearly defined labels and relationships.
Semi-Structured Nodes – Entities with attributes but without strict ontology constraints.

Unstructured Nodes

From my experience, automatic KG construction with LLMs is risky because they can easily generate unstructured nodes, creating the illusion of structured knowledge. This is especially true with smaller local LLMs, which lack the precision of larger models and often produce loosely connected text blocks rather than well-defined entities.

While unstructured nodes can be included in a KG, they don’t necessarily make it a true KG. Instead of forming explicit, logical relationships, they often represent clusters of similar text, where connections denote relevance rather than semantic meaning. This makes them closer to retrieval systems than structured knowledge representations.

The ease of generating these nodes makes automatic KGs feel deceptively complete, but in reality, they lack the precision and consistency required for structured reasoning. While useful for search and exploration, relying too heavily on unstructured nodes risks turning a KG into just another document network rather than a system of true knowledge representation.

Structured Nodes

Structured nodes are the most common and the most explicitly defined in a knowledge graph. They contain a clear label, a name, and well-defined relationships, making them the closest thing to what most people imagine when thinking of a KG. However, despite their structured nature, they are also the hardest to automatically construct, which is why they are typically found in human-annotated graphs, niche benchmarks, and task-specific toy datasets rather than large-scale, real-world KGs.

In theory, structured nodes should be the gold standard for KG construction-after all, they provide unambiguous entity definitions and relationships. But in practice, achieving this level of structure is often a pipe dream due to several factors:

Manual Effort – Most structured KGs require expert annotation, which is slow, costly, and impractical for large domains.
Domain-Specific Constraints – The rules that define valid relationships differ across fields (e.g., medicine vs. finance), making it difficult to create a one-size-fits-all automatic method.
Fragility – Unlike unstructured graphs, structured KGs demand consistency-small errors in entity classification or relation typing can break the integrity of the graph.
Scalability Issues – Expanding a structured KG while maintaining quality is difficult, leading to either stagnation or deterioration in accuracy.

Because of these challenges, structured nodes remain more of an ideal than a reality in large-scale, automated KG construction. While they are essential for high-quality benchmarks and specialized domains, their real-world application is often limited to areas where human supervision is feasible.

Semi-Structured Nodes

Semi-structured nodes are an emerging concept in recent KG research. While text-attributed nodes have existed for a long time, recent efforts focus on optimizing the balance of structured and textual content, especially in response to the rise of LLMs. Instead of rigidly defined entities or purely unstructured text, these nodes contain both attributes and free text, allowing for greater adaptability while maintaining some level of structure.

With the right alignment of advancements in KGs, LLMs, fine-tuning, and other innovations, semi-structured nodes could become the most feasible approach for automatic KG construction. They provide:

More flexibility than structured nodes, making them easier to generate automatically.
More semantic grounding than unstructured nodes, improving their usability for reasoning and retrieval.
Better integration with LLMs, as models can process both structured attributes and supporting textual information.

Tuning the right amount of textual content is key-too little, and the nodes lose adaptability; too much, and they drift toward unstructured representations. As research progresses, semi-structured nodes may bridge the gap between traditional KGs and modern AI-driven knowledge systems, making large-scale, automatically constructed KGs a more realistic goal.

Relations

In a KG, relations define meaning-they are what transform a collection of entities into a structured representation of knowledge. While various approaches exist for modeling relationships, I believe explicit and concise relations are the most effective. Implicit relationships, often justified through graph topology and embeddings, fail to provide clear semantic meaning, making them less convincing in real-world KG applications.

Another approach is weighted edges, where relations are assigned numerical values to indicate strength, confidence, or relevance. This can enhance knowledge representation, especially in uncertain or probabilistic contexts. However, just like with nodes, there’s a balance to strike-Too rigid, and the KG loses adaptability, struggling to scale or handle ambiguous knowledge. Too flexible, and it risks becoming an ill-defined similarity graph rather than a structured KG.

Finding the right level of granularity in relations is crucial. Overly complex schemas can make reasoning brittle, while oversimplified relations can lose the depth necessary for meaningful inference. Striking this balance is what determines whether a KG remains usable, interpretable, and scalable in practice.

Evolution of Autonomous KG Construction

Automatic knowledge graph construction has evolved through three major stages, each with distinct advantages and challenges:

Heuristic-Based (Barebones) – Early methods relied on handcrafted rules, pattern matching, and symbolic reasoning. While highly precise within well-defined domains, they were brittle, labor-intensive, and non-scalable, making them impractical for large, dynamic datasets.
NLP-Based (Traditional) – The introduction of syntactic parsing, named entity recognition (NER), and relation extraction allowed for more scalable KG construction. However, these methods required heavy feature engineering and often struggled with ambiguity and domain adaptation.
LLM-Based (Recent) – Modern approaches leverage large language models to extract and structure knowledge with minimal predefined rules. While this enables greater adaptability, it also introduces challenges like hallucinations, lack of interpretability, and difficulties in ensuring structured consistency.

This progression reflects a shift from precise but rigid systems to adaptive but uncertain models. While LLMs offer unprecedented automation, they also blur the line between structured knowledge and probabilistic inference, raising fundamental questions about KG reliability in real-world applications.

Structured Knowledge in LLMs: The Current Landscape

Without specialized or fine-tuned models, the integration of structured knowledge into LLMs is largely constrained to two main approaches:

Knowledge Graphs (KGs) – Structured knowledge is incorporated by retrieving triples or embeddings from a KG, either explicitly (e.g., querying specific facts) or implicitly (e.g., training on KG-derived corpora). However, without fine-tuning, LLMs struggle to effectively utilize these structures beyond surface-level retrieval.
Context Injection (RAG) – A more flexible but unstructured approach, where relevant documents or structured snippets are retrieved and fed into the model’s context. While effective for augmenting responses with factual information, this method lacks the deep relational structure of true KG-based reasoning.

These constraints define the current limitations of structured knowledge in LLMs. Without targeted adaptation, models remain detached from deterministic structures, making precise, rule-based reasoning difficult to achieve-unless one reverts back to GOFAI (Good Old-Fashioned AI).

Beyond the aforementioned directions, I’m increasingly convinced that any other approach is essentially a solved issue-there’s little left to meaningfully explore in this field. At this point, we either go all-in on extremely case-specific deep learning or machine learning, bypassing LLMs entirely, or we acknowledge that most systems remain toys until an insane amount of investment and engineering is poured in to make them viable.