HiMix : Hierarchical Visual-Textual Mixing Network for Lesion Segmentation

(* : Equal Contribution)

POSTECH

WACV 2026

:canada: Tucson, Arizona

Illustration of overall framework (Himix).


Abstract

Lesion segmentation is an essential task in medical imaging to support diagnosis and assessment of pathologies. While deep learning models have shown success in various domains, their reliance on large-scale annotated datasets limits applicability in the medical domain due to labeling cost. To address this issue, recent studies in medical image segmentation have utilized clinical texts as complementary semantic cues without additional annotations. However, most existing methods utilize a single textual embedding and fail to capture hierarchical interactions between language and visual features, which limits their ability to leverage fine-grained cues essential for precise and detailed segmentation. In this regime, we propose Hierarchical Visual-Textual Mixing Network (HiMix), a novel multi-modal segmentation framework that mixes multi-scale image and text representations throughout the mask decoding process. HiMix progressively injects hierarchical text embedding, from high-level semantics to fine-grained spatial details, into corresponding image decoder layers to bridge the modality gap and enhance visual feature refinement at multiple levels of abstraction. Experiments on the QaTa-COV19, MosMedData+ and Kvasir-SEG datasets demonstrate that HiMix consistently outperforms uni-modal and multi-modal methods. Furthermore, HiMix exhibits strong generalization to unstructured textual formats, highlighting its practical applicability in real-world clinical scenarios.

Text Feature Utilization Strategy

Figure: Illustration of text feature utilization strategy in our proposed approach. Multi-level linguistic features extracted from the text encoder are hierarchically aligned and passed to the corresponding decoder layers.


Key Modules in HiMix

Figure: Illustration of key modules in HiMix. Left: Decoder with adaptive spectrum refinement module (ASRM). Right: Dynamic layer fusion module (DLFM).


Quantitative Results

Table: Quantitative comparision on segmentation of uni-modal (top) and multi-modal (middle) learning baselines, and HiMix (bottom). The best and second-best results are highlighted in bold and underlined, respectively.


Qualitative Results

Figure: Visualization of segmentation results. Results on QaTa-COV19 (top), MosMedData+ (middle) and Kvasir-SEG (bottom) datasets. Yellow, red, and green represent true positive, false negative, and false positive, respectively.


Conclusion

In this work, we proposed HiMix, a novel multi-modal segmentation framework that effectively aligns and leverages hierarchical representations from both image and text modalities. HiMix dynamically extracts and refines essential information from both modalities to ensure a hierarchical integration of high-level semantics and fine-grained details. Experiments on diverse medical segmentation benchmarks demonstrate that HiMix consistently outperforms state-of-the-art models, which validates the advantage of hierarchical design. Moreover, HiMix further demonstrates strong adaptability to diverse text formats, showcasing its potential for practical applicability to real-scenarios.


BibTeX

@inproceedings{hwang2026himix,
      title={HiMix : Hierarchical Visual-Textual Mixing Network for Lesion Segmentation},
      author={Hwang, Soojin and Sim, Jaeyoon and Kim, Won Hwa},
      booktitle={IEEE/CVF Winter Conference on Applications of Computer Vision},
      year={2026}
    }