-
Notifications
You must be signed in to change notification settings - Fork 82
Description
Proposal for LFX Term-3 at KubeEdge
Comprehensive Example Restoration Proposal for KubeEdge Ianvs
Issue: Proposal to fix configuration and runtime failures in CIFAR100-Federated federated_learning/fedavg example
Parent Issue: #230
By: Zishan Ahmad
Mentors: Zimu Zheng, Shijing Hu
Background
The CIFAR100-Federated example is a compact framework on KubeEdge Ianvs, featuring client-side training and server-side FedAvg aggregation on CIFAR-100. It serves as a baseline for onboarding, benchmarking, and prototyping edge ML workflows. However, the example is currently broken and cannot be run due to multiple issues, each causing failures either during configuration or at runtime.
Critical Issues
-
Dependency Conflict
TensorFlow is used throughout the code but is not listed in anyrequirements.txtfile. Additionally, the Ianvs documentation suggests using Python 3.8, but running the example in common environments leads to TensorFlow version dependency errors. Even older versions, such as TensorFlow 2.10.0, trigger protobuf dependency issues.
-
Non-Portable Configuration
Hardcoded absolute paths result in "not found" or permission denied errors while runningutils.pyor the Ianvs examples. Even after resolving the path issues, the example still fails to run. Some debugging screenshots are attached:
-
Incorrect YAML Keys
The YAML file incorrectly uses the keystrain_urlandtest_url. The correct keys aretrain_indexandtest_indexfor passing this prepared dataset, as shown incore/testenvmanager/dataset/dataset.py.
-
Runtime Code Bug
Even after fixing the environment and paths, the prediction loop inbasemodel.pytriggers a fatal runtime error:
AttributeError: 'list' object has no attribute 'x'
-
Missing Documentation
NoREADME.mdexists to guide setup, dependencies, or the correct workflow, leaving users guessing and encountering unnecessary friction.
Impact
These issues make the CIFAR100-Federated example effectively unusable in its current state. They hinder onboarding, slow prototyping, and create unnecessary friction for users attempting to benchmark edge ML workflows. Resolving them would restore the example as a reliable baseline for testing, experimentation, and learning in KubeEdge Ianvs environments.
Goals
This proposal aims to restores the federated_learning/fedavg example to a fully functional, portable, and documented state. Key deliverables:
- Introduce a centralized configuration pattern for the example using a
config.pyfile to eliminate hardcoded paths and serve as a template for other examples. - Change
train_urlandtest_urltotrain_indexandtest_index, respectively. - Fix a runtime bug in
basemodel.py. - Add a comprehensive
README.mdwith setup instructions. - Include a
requirements.txtfile to formalize dependencies.
Scope
- Users: New community members, researchers, and developers exploring KubeEdge Ianvs, especially in the context of federated learning.
- Scope: Limited to the
examples/cifar100/directory and its dependency changes. No core-related changes are proposed. - Uniqueness:
- Focus Area: Focuses on the CIFAR-100 federated learning example, an area that has not received recent maintenance and is critical for new users interested in FL on KubeEdge.
- Pipeline Approach: Identifies failure points across the full pipeline, including documentation, environment, configuration, and code logic, rather than addressing a single isolated issue.
Detailed Design
Architecture
The architecture will remain within the example itself; no core Ianvs modules will be modified. Proposed components:
- Centralized Configuration Manager: Introduce
examples/cifar100/config.pyto manage configuration centrally. - Introduced YAML Placeholders: Add placeholders in YAML files that can be controlled via the config file to eliminate hardcoded values and improve maintainability. This ensures consistency makes it easier to verify via CI/CD.
- Bug Fixes: Address potential issues across YAML and Python files.
- Documentation and Dependencies: Provide a
README.mdandrequirements.txtto formalize setup instructions and dependencies.
Module Details
1. config.py (Centralized Path Configuration)
- Introduce a central configuration for input/output/model paths with both absolute and relative options.
- Dynamically replace placeholders in
.yamlfiles (e.g.,{{TRAIN_INDEX_FILE}}) with actual paths to eliminate hardcoded values and improve maintainability.
2. README.md (Documentation)
- Provide step-by-step setup instructions for first-time users.
- Specify Python 3.10+ requirement.
- Guide users through installation, data preparation, path configuration, and running benchmarks.
3. requirements.txt (Dependency Management)
- List example-specific dependency versions to ensure reproducibility.
4. basemodel.py (Bug Fix)
- Correct the prediction loop to prevent runtime errors (e.g.,
AttributeError).
5. testenv.yaml (Bug Fix)
- Replace
train_urlandtest_urlwithtrain_indexandtest_indexto standardize paths.
Expected Outcome
- Runnable: Example runs without crashes.
- Portable: Works across machines without hardcoded paths.
- Documented: Easy for new users to set up and explore.
- Reproducible: Environment and dependencies fully specified.
Note: A preliminary implementation of this proposed work has been developed and is available as open PR #251 , demonstrating a fully functional proof of concept.