Semantic Internet Data Integration

Member: Dimitrios GIAKATOS, Romain FONTUGNE

Category: Exploring

Tags: internet infrastructure, internet measurement, data integration, ai, llms

  1. Background:The Internet has evolved into a complex ecosystem where critical data regarding its structure and operation is often disconnected or inaccessible. The rapid expansion of network services has outpaced our ability to analyze them, as administrative and technical datasets remain poorly linked. This lack of integration creates a transparency gap, making it difficult to understand the relationship between the Internet’s physical infrastructure and the entities that control it.
  2. Purpose:Develop methodologies and automated frameworks that bridge the gap between technical network infrastructure and administrative metadata. By combining and linking different datasets, we aim to eliminate the “transparency gap”, enabling researchers, policymakers, and regulators to gain a multi-dimensional and actionable understanding of Internet ownership and operational behavior.
  3. Approach:Combination of Large Language Models (LLMs) and advanced data integration techniques to synthesize heterogeneous data sources. Our methodologies focus on:
    1. Semantic Standardization: Using LLMs to translate and extract insights from unstructured data.
    2. Cross-Domain Linking: Developing automated systems to map physical infrastructure to entities like organizational identities and business types.
    3. Benchmarking & Scalability: Creating open datasets and evaluation frameworks to ensure the reproducibility and accuracy of Internet-scale analysis.

Internet Data Analysis in English


Internet data analysis is important for policymakers and regulators to make informed decisions. Despite the various datasets made available by the research community, the analysis of these datasets is challenging due to their various formats and required analysis tool. The Internet Yellow Pages (IYP) database is designed to simplify Internet data analysis by combining many datasets into a single format. Hence users can retrieve various Internet data using a single querying language called Cypher. However, learning Cypher and understanding IYP’s schema is still challenging. CodeLlama, a family of Large Language Models (LLMs) trained for code generation, offers the potential to generate Cypher queries from English text. Yet, no benchmarks exist to evaluate the effectiveness of LLMs for the Cypher language, and existing LLMs, to our knowledge, misinterpret the IYP schema.

Introduction to LLMs

We publish a three-part tutorial series exploring the rising popularity and diverse applications of LLMs. This series is designed for beginners without prior computer science or programming knowledge and is available on the IIJ Engineers Blog.

Resources:

Exploring LLM architecture for generating Cypher queries

We explore one LLM architecture designed to enhance Cypher query generation for IYP, containing a schema-based linter that detects and reports schema-related errors in the generated Cypher queries.

Work published at ICEA’24.

Pythia: Facilitating Access to Internet Data Using LLMs and IYP

We assess the ability of different LLMs to generate Cypher queries. We introduce CypherEval, a dataset for evaluating LLM-generated Cypher queries, and we propose a methodology to benchmark LLMs for IYP. Finally, we present Pythia, a system for the generation of IYP Cypher queries.

Dataset published at Codeberg.

Work published at IEEE LCN’25.

Pear: The open-source peering tool


The Pear, is an open-source peering tool that revolutionizes Internet traffic exchange by combining traffic and BGP data to provide in-depth analysis. This innovative platform empowers network operators to discover, connect, and exchange Internet traffic with other networks, ultimately enhancing the efficiency, scalability, and performance of the Internet. By using Pear, networks can reduce their dependence on transit providers, lower operational costs, and improve the overall quality of their Internet services.

Tool published at Codeberg.

Network Infrastructure Classification


Networks are the backbone of global communication and data exchange. Despite their critical role, the ownership and operational structures governing these networks remain poorly understood. A significant challenge lies in the inconsistent maintenance of organizational data, which often leads to fragmented mapping of both Autonomous Systems (ASes) and IP prefixes. While existing research has attempted to refine network organization datasets, significant gaps remain. This research aims to enhance the accuracy and completeness of network organization information. By establishing a more robust mapping framework, this work seeks to improve transparency and enable more effective management of global Internet infrastructure.

PAGE TOP