Springer Science and Business Media LLC

Công bố khoa học tiêu biểu

* Dữ liệu chỉ mang tính chất tham khảo

Sắp xếp:  
Self-attention Guidance Based Crowd Localization and Counting
Springer Science and Business Media LLC - - 2024
Zhouzhou Ma, Guanghua Gu, Wenrui Zhao
Most existing studies on crowd analysis are limited to the level of counting, which cannot provide the exact location of individuals. This paper proposes a self-attention guidance based crowd localization and counting network (SA-CLCN), which can simultaneously locate and count crowds. We take the form of object detection, using the original point annotations of crowd datasets as supervision to train the network. Ultimately, the center point coordinate of each head as well as the number of crowds are predicted. Specifically, to cope with the spatial and positional variations of the crowd, the proposed method introduces transformer to construct a globallocal feature extractor (GLFE) together with the convolutional structure. It establishes the near-to-far dependency between elements so that the global context and local detail features of the crowd image can be extracted simultaneously. Then, this paper designs a pyramid feature fusion module (PFFM) to fuse the global and local information from high level to low level to obtain a multiscale feature representation. In downstream tasks, this paper predicts candidate point offsets and confidence scores by a simple regression header and classification header. In addition, the Hungarian algorithm is used to match the predicted point set and the labelled point set to facilitate the calculation of losses. The proposed network avoids the errors or higher costs associated with using traditional density maps or bounding box annotations. Importantly, we have conducted extensive experiments on several crowd datasets, and the proposed method has produced competitive results in both counting and localization.
Robust Local Light Field Synthesis via Occlusion-aware Sampling and Deep Visual Feature Fusion
Springer Science and Business Media LLC - - 2023
Wenpeng Xing, Jie Chen, Yike Guo
Abstract

Novel view synthesis has attracted tremendous research attention recently for its applications in virtual reality and immersive telepresence. Rendering a locally immersive light field (LF) based on arbitrary large baseline RGB references is a challenging problem that lacks efficient solutions with existing novel view synthesis techniques. In this work, we aim at truthfully rendering local immersive novel views/LF images based on large baseline LF captures and a single RGB image in the target view. To fully explore the precious information from source LF captures, we propose a novel occlusion-aware source sampler (OSS) module which efficiently transfers the pixels of source views to the target view’s frustum in an occlusion-aware manner. An attention-based deep visual fusion module is proposed to fuse the revealed occluded background content with a preliminary LF into a final refined LF. The proposed source sampling and fusion mechanism not only helps to provide information for occluded regions from varying observation angles, but also proves to be able to effectively enhance the visual rendering quality. Experimental results show that our proposed method is able to render high-quality LF images/novel views with sparse RGB references and outperforms state-of-the-art LF rendering and novel view synthesis methods.

Multimodal Pretraining from Monolingual to Multilingual
Springer Science and Business Media LLC - Tập 20 - Trang 220-232 - 2023
Liang Zhang, Ludan Ruan, Anwen Hu, Qin Jin
Multimodal pretraining has made convincing achievements in various downstream tasks in recent years. However, since the majority of the existing works construct models based on English, their applications are limited by language. In this work, we address this issue by developing models with multimodal and multilingual capabilities. We explore two types of methods to extend multimodal pretraining model from monolingual to multilingual. Specifically, we propose a pretraining-based model named multilingual multimodal pretraining (MLMM), and two generalization-based models named multilingual CLIP (M-CLIP) and multilingual acquisition (MLA). In addition, we further extend the generalization-based models to incorporate the audio modality and develop the multilingual CLIP for vision, language, and audio (CLIP4VLA). Our models achieve state-of-the-art performances on multilingual vision-text retrieval, visual question answering, and image captioning benchmarks. Based on the experimental results, we discuss the pros and cons of the two types of models and their potential practical applications.
Stability and Generalization of Hypergraph Collaborative Networks
Springer Science and Business Media LLC - Tập 21 - Trang 184-196 - 2024
Michael K. Ng, Hanrui Wu, Andy Yip
Graph neural networks have been shown to be very effective in utilizing pairwise relationships across samples. Recently, there have been several successful proposals to generalize graph neural networks to hypergraph neural networks to exploit more complex relationships. In particular, the hypergraph collaborative networks yield superior results compared to other hypergraph neural networks for various semi-supervised learning tasks. The collaborative network can provide high quality vertex embeddings and hyperedge embeddings together by formulating them as a joint optimization problem and by using their consistency in reconstructing the given hypergraph. In this paper, we aim to establish the algorithmic stability of the core layer of the collaborative network and provide generalization guarantees. The analysis sheds light on the design of hypergraph filters in collaborative networks, for instance, how the data and hypergraph filters should be scaled to achieve uniform stability of the learning process. Some experimental results on real-world datasets are presented to illustrate the theory.
An Empirical Study on Google Research Football Multi-agent Scenarios
Springer Science and Business Media LLC - - Trang 1-22 - 2024
Yan Song, He Jiang, Zheng Tian, Haifeng Zhang, Yingping Zhang, Jiangcheng Zhu, Zonghong Dai, Weinan Zhang, Jun Wang
Few multi-agent reinforcement learning (MARL) researches on Google research football (GRF) focus on the 11-vs-11 multi-agent full-game scenario and to the best of our knowledge, no open benchmark on this scenario has been released to the public. In this work, we fill the gap by providing a population-based MARL training pipeline and hyperparameter settings on multi-agent football scenario that outperforms the bot with difficulty 1.0 from scratch within 2 million steps. Our experiments serve as a reference for the expected performance of independent proximal policy optimization (IPPO), a state-of-the-art multi-agent reinforcement learning algorithm where each agent tries to maximize its own policy independently across various training configurations. Meanwhile, we release our training framework Light-MALib which extends the MALib codebase by distributed and asynchronous implementation with additional analytical tools for football games. Finally, we provide guidance for building strong football AI with population-based training and release diverse pretrained policies for benchmarking. The goal is to provide the community with a head start for whoever experiment their works on GRF and a simple-to-use population-based training framework for further improving their agents through self-play. The implementation is available at https://github.com/Shanghai-Digital-Brain-Laboratory/DB-Football .
Chia sẻ trọng số trong các lớp nông thông qua các phép tích chập tương đương nhóm quay Dịch bởi AI
Springer Science and Business Media LLC - Tập 19 - Trang 115-126 - 2022
Zhiqiang Chen, Ting-Bing Xu, Jinpeng Li, Huiguang He
Phép toán tích chập có đặc tính equivariance nhóm dịch chuyển. Để đạt được nhiều tính chất equivariance nhóm hơn, các phép tích chập tương đương nhóm quay (RGEC) được đề xuất nhằm đạt được cả tính chất equivariance nhóm dịch chuyển và quay. Tuy nhiên, các công trình trước đó tập trung nhiều hơn vào số lượng tham số mà thường bỏ qua các chi phí tài nguyên khác. Trong bài báo này, chúng tôi xây dựng mạng lưới của mình mà không đưa ra thêm chi phí tài nguyên. Cụ thể, một bộ lọc tích chập được quay đến các hướng khác nhau để trích xuất đặc trưng từ nhiều kênh. Đồng thời, chúng tôi sử dụng ít bộ lọc hơn nhiều so với các công trình trước đó để đảm bảo rằng số kênh đầu ra không tăng lên. Để tăng cường tính trực giao của các bộ lọc ở các hướng khác nhau, chúng tôi xây dựng hàm mất mát không tối đa hóa trên chiều quay để chặn các hướng khác trừ hướng có kích hoạt cao nhất. Xem xét rằng các đặc trưng cấp thấp hưởng lợi nhiều hơn từ tính đối xứng quay, chúng tôi chỉ chia sẻ trọng số trong các lớp nông (SWSL) thông qua RGEC. Các thử nghiệm rộng rãi trên nhiều tập dữ liệu (ví dụ: ImageNet, CIFAR và MNIST) cho thấy SWSL có thể hưởng lợi hiệu quả từ việc chia sẻ trọng số cấp cao hơn và cải thiện hiệu suất của nhiều mạng khác nhau, bao gồm cả kiến trúc plain và ResNet. Trong khi đó, số lượng bộ lọc và tham số tích chập ít hơn nhiều (ví dụ: ít hơn 75%, 87,5%) trong các lớp nông, và không có chi phí tính toán bổ sung nào được đưa ra.
#RGEC #chia sẻ trọng số #tính trực giao #mạng nơron sâu #phép tích chập nhóm quay
Machine Learning for Cataract Classification/Grading on Ophthalmic Imaging Modalities: A Survey
Springer Science and Business Media LLC - Tập 19 - Trang 184-208 - 2022
Xiao-Qing Zhang, Yan Hu, Zun-Jie Xiao, Jian-Sheng Fang, Risa Higashita, Jiang Liu
Cataracts are the leading cause of visual impairment and blindness globally. Over the years, researchers have achieved significant progress in developing state-of-the-art machine learning techniques for automatic cataract classification and grading, aiming to prevent cataracts early and improve clinicians’ diagnosis efficiency. This survey provides a comprehensive survey of recent advances in machine learning techniques for cataract classification/grading based on ophthalmic images. We summarize existing literature from two research directions: conventional machine learning methods and deep learning methods. This survey also provides insights into existing works of both merits and limitations. In addition, we discuss several challenges of automatic cataract classification/grading based on machine learning techniques and present possible solutions to these challenges for future research.
ECG Biometrics via Enhanced Correlation and Semantic-rich Embedding
Springer Science and Business Media LLC - Tập 20 - Trang 697-706 - 2023
Kui-Kui Wang, Gong-Ping Yang, Lu Yang, Yu-Wen Huang, Yi-Long Yin
Electrocardiogram (ECG) biometric recognition has gained considerable attention, and various methods have been proposed to facilitate its development. However, one limitation is that the diversity of ECG signals affects the recognition performance. To address this issue, in this paper, we propose a novel ECG biometrics framework based on enhanced correlation and semantic-rich embedding. Firstly, we construct an enhanced correlation between the base feature and latent representation by using only one projection. Secondly, to fully exploit the semantic information, we take both the label and pairwise similarity into consideration to reduce the influence of ECG sample diversity. Furthermore, to solve the objective function, we propose an effective and efficient algorithm for optimization. Finally, extensive experiments are conducted on two benchmark datasets, and the experimental results show the effectiveness of our framework.
Practical Blind Image Denoising via Swin-Conv-UNet and Data Synthesis
Springer Science and Business Media LLC - Tập 20 - Trang 822-836 - 2023
Kai Zhang, Yawei Li, Jingyun Liang, Jiezhang Cao, Yulun Zhang, Hao Tang, Deng-Ping Fan, Radu Timofte, Luc Van Gool
While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising, existing methods mostly rely on simple noise assumptions, such as additive white Gaussian noise (AWGN), JPEG compression noise and camera sensor noise, and a general-purpose blind denoising method for real images remains unsolved. In this paper, we attempt to solve this problem from the perspective of network architecture design and training data synthesis. Specifically, for the network architecture design, we propose a swin-conv block to incorporate the local modeling ability of residual convolutional layer and non-local modeling ability of swin transformer block, and then plug it as the main building block into the widely-used image-to-image translation UNet architecture. For the training data synthesis, we design a practical noise degradation model which takes into consideration different kinds of noise (including Gaussian, Poisson, speckle, JPEG compression, and processed camera sensor noises) and resizing, and also involves a random shuffle strategy and a double degradation strategy. Extensive experiments on AGWN removal and real image denoising demonstrate that the new network architecture design achieves state-of-the-art performance and the new degradation model can help to significantly improve the practicability. We believe our work can provide useful insights into current denoising research. The source code is available at https://github.com/cszn/SCUNet .
Towards Jumping Skill Learning by Target-guided Policy Optimization for Quadruped Robots
Springer Science and Business Media LLC - - 2024
Chi Zhang, Wei Zou, Ningbo Cheng, Shuomo Zhang
Endowing quadruped robots with the skill to forward jump is conducive to making it overcome barriers and pass through complex terrains. In this paper, a model-free control architecture with target-guided policy optimization and deep reinforcement learning (DRL) for quadruped robot jumping is presented. First, the jumping phase is divided into take-off and flight-landing phases, and optimal strategies with solt actor-critic (SAC) are constructed for the two phases respectively. Second, policy learning including expectations, penalties in the overall jumping process, and extrinsic excitations is designed. Corresponding policies and constraints are all provided for successful take-off, excellent flight attitude and stable standing after landing. In order to avoid low efficiency of random exploration, a curiosity module is introduced as extrinsic rewards to solve this problem. Additionally, the target-guided module encourages the robot explore closer and closer to desired jumping target. Simulation results indicate that the quadruped robot can realize completed forward jumping locomotion with good horizontal and vertical distances, as well as excellent motion attitudes.
Tổng số: 94   
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 10