Member: Dimitrios GIAKATOS, Romain FONTUGNE
Category: Exploring
Tags: explainability, classification, business intelligence, traffic analysis, network monitoring
- Background:Internet is a big complex network. The rapid evolution of the Internet has led to a growing need for in-depth analysis. In addition, the increasing volume of data makes the Internet grow even faster, making it challenging to analyze it, as the majority of the data related to the Internet are not linked or are difficult to access.
- Purpose:Develop methods that help Internet data analysis and link Internet’s infrastructure with related but unconnected data, resulting in a better understanding.
- Approach:Create datasets representing different parts of the Internet by combining data from different sources, followed by further analysis.
Internet Data Analysis in English
Internet data analysis is important for policymakers and regulators to make informed decisions. Despite the various datasets made available by the research community, the analysis of these datasets is challenging due to their various formats and required analysis tool. The Internet Yellow Pages (IYP) database is designed to simplify Internet data analysis by combining many datasets into a single format. Hence users can retrieve various Internet data using a single querying language called Cypher. However, learning Cypher and understanding IYP’s schema is still challenging. CodeLlama, a family of Large Language Models (LLMs) trained for code generation, offers the potential to generate Cypher queries from English text. Yet, no benchmarks exist to evaluate the effectiveness of LLMs for the Cypher language, and existing LLMs, to our knowledge, misinterpret the IYP schema.
Introduction to LLMs
We publish a three-part tutorial series exploring the rising popularity and diverse applications of LLMs. This series is designed for beginners without prior computer science or programming knowledge and is available on the IIJ Engineers Blog.
Resources:
- IIJ Engineers Blog article part A
- IIJ Engineers Blog article part B
- IIJ Engineers Blog article part C
Unlocking CodeLlama for Internet Yellow Pages
We introduce CypherEval, a dataset for evaluating LLM-generated Cypher queries, and proposes a methodology to benchmark CodeLlama models for IYP. We found that the quality of CodeLlama’s varies greatly with the prompt’s difficulty. By classifying errors made by CodeLlama, we build a taxonomy for CodeLlama LLM-generated Cypher queries, and we show that most errors come from misinterpreting IYP’s schema (76.5%) and from returning wrong fields (74.1%).
Dataset published at Codeberg.
Exploring LLM architecture for generating Cypher queries
We explore one LLM architecture designed to enhance Cypher query generation for IYP, containing a schema-based linter that detects and reports schema-related errors in the generated Cypher queries.
Work published at ICEA’24.
Pear: The open-source peering tool
The Pear, is an open-source peering tool that revolutionizes Internet traffic exchange by combining traffic and BGP data to provide in-depth analysis. This innovative platform empowers network operators to discover, connect, and exchange Internet traffic with other networks, ultimately enhancing the efficiency, scalability, and performance of the Internet. By using Pear, networks can reduce their dependence on transit providers, lower operational costs, and improve the overall quality of their Internet services.
Tool published at Codeberg.
ASes business type classification
Autonomous Systems (ASes) are important for global communication and data exchange. Despite their vital role, the organizations that own and operate these ASes remain poorly understood. A significant challenge is the inconsistent maintenance of AS organization information by Regional Internet Registries (RIRs). While existing research has made efforts to enhance AS organization information, there is still room for improvement, and further work is needed to provide a more comprehensive and accurate understanding of these entities. The goal of this research is to enhance the accuracy and completeness of AS organization information, ultimately leading to improved transparency and more effective management of the Internet’s infrastructure.