Skip to content

Pre-Test: federated_learning/fedavg Example for KubeEdge Ianvs (LFX 2025 T3) #252

@Spiral-Memory

Description

@Spiral-Memory

Proposal for LFX Term-3 at KubeEdge

Comprehensive Example Restoration Proposal for KubeEdge Ianvs

Issue: Proposal to fix configuration and runtime failures in CIFAR100-Federated federated_learning/fedavg example

Parent Issue: #230

By: Zishan Ahmad

Mentors: Zimu Zheng, Shijing Hu

Background

The CIFAR100-Federated example is a compact framework on KubeEdge Ianvs, featuring client-side training and server-side FedAvg aggregation on CIFAR-100. It serves as a baseline for onboarding, benchmarking, and prototyping edge ML workflows. However, the example is currently broken and cannot be run due to multiple issues, each causing failures either during configuration or at runtime.

Critical Issues

  1. Dependency Conflict
    TensorFlow is used throughout the code but is not listed in any requirements.txt file. Additionally, the Ianvs documentation suggests using Python 3.8, but running the example in common environments leads to TensorFlow version dependency errors. Even older versions, such as TensorFlow 2.10.0, trigger protobuf dependency issues.

    Image
  2. Non-Portable Configuration
    Hardcoded absolute paths result in "not found" or permission denied errors while running utils.py or the Ianvs examples. Even after resolving the path issues, the example still fails to run. Some debugging screenshots are attached:

    Image Image Image Image
  3. Incorrect YAML Keys
    The YAML file incorrectly uses the keys train_url and test_url. The correct keys are train_index and test_index for passing this prepared dataset, as shown in core/testenvmanager/dataset/dataset.py.

    Image
  4. Runtime Code Bug
    Even after fixing the environment and paths, the prediction loop in basemodel.py triggers a fatal runtime error:

    Image
    AttributeError: 'list' object has no attribute 'x'
  5. Missing Documentation
    No README.md exists to guide setup, dependencies, or the correct workflow, leaving users guessing and encountering unnecessary friction.

Impact

These issues make the CIFAR100-Federated example effectively unusable in its current state. They hinder onboarding, slow prototyping, and create unnecessary friction for users attempting to benchmark edge ML workflows. Resolving them would restore the example as a reliable baseline for testing, experimentation, and learning in KubeEdge Ianvs environments.

Goals

This proposal aims to restores the federated_learning/fedavg example to a fully functional, portable, and documented state. Key deliverables:

  1. Introduce a centralized configuration pattern for the example using a config.py file to eliminate hardcoded paths and serve as a template for other examples.
  2. Change train_url and test_url to train_index and test_index, respectively.
  3. Fix a runtime bug in basemodel.py.
  4. Add a comprehensive README.md with setup instructions.
  5. Include a requirements.txt file to formalize dependencies.

Scope

  • Users: New community members, researchers, and developers exploring KubeEdge Ianvs, especially in the context of federated learning.
  • Scope: Limited to the examples/cifar100/ directory and its dependency changes. No core-related changes are proposed.
  • Uniqueness:
    • Focus Area: Focuses on the CIFAR-100 federated learning example, an area that has not received recent maintenance and is critical for new users interested in FL on KubeEdge.
    • Pipeline Approach: Identifies failure points across the full pipeline, including documentation, environment, configuration, and code logic, rather than addressing a single isolated issue.

Detailed Design

Architecture

The architecture will remain within the example itself; no core Ianvs modules will be modified. Proposed components:

  • Centralized Configuration Manager: Introduce examples/cifar100/config.py to manage configuration centrally.
  • Introduced YAML Placeholders: Add placeholders in YAML files that can be controlled via the config file to eliminate hardcoded values and improve maintainability. This ensures consistency makes it easier to verify via CI/CD.
  • Bug Fixes: Address potential issues across YAML and Python files.
  • Documentation and Dependencies: Provide a README.md and requirements.txt to formalize setup instructions and dependencies.
Image

Module Details

1. config.py (Centralized Path Configuration)

  • Introduce a central configuration for input/output/model paths with both absolute and relative options.
  • Dynamically replace placeholders in .yaml files (e.g., {{TRAIN_INDEX_FILE}}) with actual paths to eliminate hardcoded values and improve maintainability.

2. README.md (Documentation)

  • Provide step-by-step setup instructions for first-time users.
  • Specify Python 3.10+ requirement.
  • Guide users through installation, data preparation, path configuration, and running benchmarks.

3. requirements.txt (Dependency Management)

  • List example-specific dependency versions to ensure reproducibility.

4. basemodel.py (Bug Fix)

  • Correct the prediction loop to prevent runtime errors (e.g., AttributeError).

5. testenv.yaml (Bug Fix)

  • Replace train_url and test_url with train_index and test_index to standardize paths.

Expected Outcome

  • Runnable: Example runs without crashes.
  • Portable: Works across machines without hardcoded paths.
  • Documented: Easy for new users to set up and explore.
  • Reproducible: Environment and dependencies fully specified.

Note: A preliminary implementation of this proposed work has been developed and is available as open PR #251 , demonstrating a fully functional proof of concept.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions