7/3/2023 0 Comments Image mixer 3![]() Self-supervised visual descriptor learning for dense correspondence. Ĭhoy, C.B., Gwak, J.Y., Savarese, S., Chandraker, M.: Universal correspondence network. Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. DISK: Learning local features with policy gradient. Luo, Z., Zhou, L., Bai, X., Chen, H., Zhang, J., Yao, Y., Li, S., Fang, T., Quan, L.: ASLFeat: Learning local features of accurate shape and localization. ![]() Liu, Y., Shen, Z., Lin, Z., Peng, S., Bao, H., Zhou, X.: GIFT: learning transformation-invariant dense visual descriptors via group cnns. ĭusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., Sattler, T.: D2-Net: A trainable CNN for joint detection and description of local features. Revaud, J., Weinzaepfel, P., Souza, C.D., Pion, N., Csurka, G., Cabon, Y., Humen-berger, M.: R2D2: repeatable and reliable detector and descriptor. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. G.: ORB: An efficient alternative to SIFT or SURF. ![]() Rublee, E., Rabaud, V., Konolige, K., Bradski. Ĭavalli, L., Larsson, V., Oswald, M.R., Sattler, T., Pollefeys, M.: Handcrafted Outlier Detection Revisited.In: European Conference on Computer Vision. Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-deeplab: stand-alone axial-attention for panoptic segmentation. Truong, P et al.: GOCor: Bringing globally optimized correspondence volumes into your neural network. G.E.: A scalable hierarchical dis-tributed language model. In: Neural Information Processing Systems. Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Keysers, D., Uszkoreit, J., Lucic, M., Dosovitskiy, A.: MLP-mixer: an all-MLP architecture for vision. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Conference on computer vision and pattern recognition, depth and motion network for learning monocular stereo, demon, p. Vaswani, A., Shazeer, N., Parmar, N., Reit, J.U., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin I.: Attention is all you need. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: detector-free local feature matching with transformers. Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: Correspondence Transformer for Matching Across Images, pp. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: International Conference on Machine Learning. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J'egou, H.: Training data-efficient image transformers and distillation through attention. ĭosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16×16 words: transformers for image recognition at scale. Zafrir, O., Boudoukh, G., Izsak, P., Wasserblat, M.: Q8BERT: quantized 8bit BERT. In: CVPRW.ĭeTone, D., Malisiewicz, T., Rabi-Novich, A.: SuperPoint: self-supervised interest point detection and description. (2016).ĭeTone, D., Malisiewicz, T., Rabi-novich, A.: Toward geometric deep slam. P.: LIFT: Learned invariant feature transform. Jin, Y., Mishkin, D., Mishchuk, A., Matas, J., Fua, P., Yi, K.M., Trulls, E.: Image-matching across wide baselines: from paper to practice. Our method has significant advantages in terms of real-time performance and largely reduces computational cost, proving its effectiveness in image-matching tasks. By conducting experiments with indoor and outdoor relative poses, our MLP architecture is compared with CNN and transformer-based image-matching methods. Furthermore, the implemented global field-of-view mixer MLP framework for image-matching incurs a low computational cost. Accordingly, we constructed a mixer MLP architecture called Mixer-WMLP, which evenly divides the feature map into non-overlapping windows, spreads each window as a token, achieves the exchange of token information between spatial locations, channels features through a two-layer MLP structure in the coarse-level model, and then feeds the windows with dense fine-level matching, thereby producing the final matches. Therefore, we designed the Mixer MLP Architecture for Image-Matching (MAIM), which is a coarse to fine-level detector-free image-matching scheme. Compared with convolutional neural networks (CNNs) and visual transformers, MLP-based visual backbones have less induction bias, which can improve the sample utilization efficiency and reduce computational costs. Recent advances in multilayer perceptron (MLP) models have provided new effective network architecture designs for computer vision tasks.
0 Comments
Leave a Reply. |