HeXtractor: Automated Heterogeneous Graph Extraction from Multimodal Data

Introduction

Transforming diverse data sources into graph representations suitable for deep learning remains a challenge in modern machine learning. HeXtractor provides a standardized, automated framework for converting structured and unstructured data into heterogeneous graphs compatible with Graph Neural Networks (GNNs).

HeXtractor architecture HeXtractor in action

Publication and Citation

Publication: Journal of Open Source Software, Volume 10, Issue 110, Article 8057, 2025
Authors: Filip Wójcik, Marcin Malczewski
Repository: Available on GitHub
Documentation: HeXtractor Documentation

Citation: Wójcik, F., & Malczewski, M. (2025). HeXtractor: Extracting Heterogeneous Graphs from Structured and Textual Data for Graph Neural Networks. Journal of Open Source Software, 10(110), 8057. https://doi.org/10.21105/joss.08057

Software Overview

HeXtractor is an open-source Python library that streamlines heterogeneous graph construction. Originally developed as part of the HexGIN project for financial transaction analysis, it has evolved into a domain-agnostic framework serving researchers and practitioners across multiple fields.

Technical Architecture

Core Capabilities

HeXtractor provides comprehensive functionality for graph construction:

  1. Declarative Schema Definition: Intuitive interface for specifying node types, edge relationships, and associated metadata
  2. Multi-Modal Data Processing: Unified handling of structured tabular data and unstructured text
  3. LLM Integration: Seamless connection with Large Language Models via LangChain for semantic graph extraction
  4. PyTorch Geometric Compatibility: Direct export to PyG’s HeteroData format for immediate use in GNN training

Implementation Features

The library incorporates several advanced technical features:

  • Schema Validation: Automatic consistency checking and error detection
  • Interactive Visualization: PyVis-based graph visualization with customizable layouts
  • Flexible Data Ingestion: Support for single-table and multi-table relational data
  • Scalable Architecture: Designed to handle datasets of varying complexity

Methodological Contributions

Structured Data Processing

HeXtractor supports tabular data transformation through:

Single-Table Mode: Each row encodes relationships among entities defined by columns, automatically generating node and edge definitions

Multi-Table Mode: GraphSpecs framework for merging entity and relationship tables into unified heterogeneous graphs, maintaining referential integrity

Text-Based Graph Extraction

Integration with Large Language Models enables semantic graph construction:

  1. Natural language input processing through LLM APIs
  2. GraphDocument generation containing entities and relationships
  3. Automatic conversion to HeteroData objects
  4. Preservation of semantic relationships in graph structure

Research Applications

Domain-Agnostic Design

HeXtractor’s flexibility enables applications across multiple research areas:

  • Financial Analysis: Money laundering detection and fraud identification in transaction networks
  • Recommendation Systems: Building user-item interaction graphs with rich feature representations
  • Biomedical Research: Knowledge graph construction for drug discovery and protein interaction networks
  • Social Network Analysis: Modeling complex social relationships and information diffusion

Case Studies

The software has been successfully deployed in several high-impact applications:

  • Banking fraud detection systems processing millions of transactions
  • E-commerce recommendation engines serving personalized suggestions
  • Academic knowledge graph construction from scientific literature

Impact on Graph Machine Learning

Addressing Challenges

HeXtractor addresses common problems in graph-based machine learning:

  1. Standardization Gap: Provides consistent methodology for graph construction across different domains
  2. Reproducibility: Ensures experimental reproducibility through declarative specifications
  3. Accessibility: Lowers barrier to entry for researchers new to graph neural networks
  4. Integration Complexity: Simplifies pipeline from raw data to trainable graph models

Community Adoption

The open-source nature of HeXtractor has fostered adoption:

  • Active community contributions and feature requests
  • Integration into existing ML pipelines
  • Use in educational settings for teaching graph neural networks

Technical Examples

Structured Data Transformation

HeteroData(
  company={ x=[3, 2] },
  employee={ x=[7, 2], y=[7] },
  (company, has, employee)={ edge_index=[2, 6] }
)

Text-to-Graph Conversion

Natural language descriptions automatically transformed into structured graph representations with entity recognition and relationship extraction powered by state-of-the-art language models.

Future Development

Ongoing development focuses on:

  • Enhanced support for temporal graph construction
  • Integration with additional graph learning frameworks
  • Advanced visualization capabilities for large-scale graphs
  • Automated hyperparameter optimization for graph construction

Conclusion

HeXtractor represents a significant contribution to the graph machine learning ecosystem, democratizing access to heterogeneous graph construction capabilities. By bridging the gap between diverse data sources and graph neural networks, it enables researchers to focus on model development rather than data preprocessing complexities.