Table of Contents

🌐 Web Content Info Extractor to JSON Demo Overview


🎯 Purpose of the Sample

The Web Content Info Extractor to JSON Demo demonstrates the use of the LM-Kit.NET SDK to extract and summarize information from web pages, outputting the results in a predefined JSON format. This sample illustrates how to utilize a language model to process HTML content, identify key details, and structure the extracted information in a standardized format. Additionally, it showcases the use of the LMKit.Grammar class to apply formatting guidance to the language model, ensuring consistent and accurate JSON output.


👥 Industry Target Audience

This sample is particularly useful for developers and organizations in the following sectors:

  • 📰 Content Aggregation: Businesses that need to aggregate and summarize web content.
  • 📊 Data Analysis: Analysts who require structured data from web pages for analysis.
  • 🛠️ Web Scraping: Developers creating web scraping tools to extract specific information from websites.
  • 📈 SEO and Marketing: Professionals needing to analyze and categorize web content for marketing and SEO purposes.

🚀 Problem Solved

Web content often contains valuable information that is unstructured and difficult to process programmatically. This demo addresses the problem by using a language model to automatically extract and summarize key information from web pages, structuring it in a JSON format for easy analysis and integration with other systems.


💻 Sample Application Description

The Web Content Info Extractor to JSON Demo is a console application that allows users to input web page URLs, extract relevant content, and summarize the information into a JSON structure. The application uses a pre-trained language model to perform this task, ensuring accurate and context-aware extraction of information.

✨ Key Features

  • 📈 Model Selection: Users can choose from multiple pre-trained models.
  • 📄 Content Extraction: Downloads and extracts text content from the provided web page URL.
  • 📝 Information Summarization: Summarizes extracted content into a JSON structure with fields for primary topic, domain or field, language, and audience.
  • 📐 Grammar-based Parsing: Uses grammar-based parsing with the LMKit.TextGeneration.Sampling.Grammar class to ensure the output is structured correctly.

🧠 Supported Models

The sample supports the following pre-trained models for content extraction and summarization:

  • Mistral Nemo 2407 12.2B
  • Meta Llama 3.1 8B
  • Google Gemma2 9B Medium
  • Microsoft Phi-3 3.82B Mini
  • Alibaba Qwen-2 7.6B

🛠️ Getting Started

📋 Prerequisites

  • .NET Framework 4.6.2 or .NET 6.0

📥 Download the Project

▶️ Running the Application

  1. 📂 Clone the repository:

    git clone https://github.com/LM-Kit/lm-kit-net-samples.git
    
  2. 📁 Navigate to the project directory:

    cd lm-kit-net-samples/console_framework_4.62/web_content_info_extractor_to_json
    

    or

    cd lm-kit-net-samples/console_net6/web_content_info_extractor_to_json
    
  3. 🔨 Build and run the application:

    dotnet build
    dotnet run
    

💡 Example Usage

  1. Set the License Key (if available):

    LMKit.Licensing.LicenseManager.SetLicenseKey(""); // Set an optional license key here if available.
    
  2. Select the Model: The application will prompt you to select one of the pre-defined models or enter a custom model URI.

  3. Enter the Web Page URL:

    Enter webpage page URI to be analyzed: https://example.com
    
  4. View the Extracted Information: The application will download and process the content from the provided URL, and output the summarized information in JSON format.

By following these steps, developers can explore the functionalities of LM-Kit.NET and integrate advanced web content extraction and summarization capabilities into their applications, ensuring structured and accurate data extraction through automated, AI-driven processes.