new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Dec 10

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax rules provided by LaTeX, which may fail to describe the position and hierarchical relationship between symbols due to complex structural relations and diverse handwriting styles. To overcome this challenge, we propose a position forest transformer (PosFormer) for HMER, which jointly optimizes two tasks: expression recognition and position recognition, to explicitly enable position-aware symbol feature representation learning. Specifically, we first design a position forest that models the mathematical expression as a forest structure and parses the relative position relationships between symbols. Without requiring extra annotations, each symbol is assigned a position identifier in the forest to denote its relative spatial position. Second, we propose an implicit attention correction module to accurately capture attention for HMER in the sequence-based decoder architecture. Extensive experiments validate the superiority of PosFormer, which consistently outperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62% gains on the single-line CROHME 2014/2016/2019, multi-line M2E, and complex MNE datasets, respectively, with no additional latency or computational cost. Code is available at https://github.com/SJTU-DeepVisionLab/PosFormer.

  • 4 authors
·
Jul 10, 2024

TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations

Timber represents an increasingly valuable and versatile resource. However, forestry operations such as harvesting, handling and measuring logs still require substantial human labor in remote environments posing significant safety risks. Progressively automating these tasks has the potential of increasing their efficiency as well as safety, but requires an accurate detection of individual logs as well as live trees and their context. Although initial approaches have been proposed for this challenging application domain, specialized data and algorithms are still too scarce to develop robust solutions. To mitigate this gap, we introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. Based on this data, we conduct a series of ablation experiments for oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Furthermore, we automatically derive geometric properties and apply multi-object tracking to further enhance robustness. Our detection and tracking approach provides highly descriptive and accurate trunk representations solely from RGB image data, even under challenging environmental conditions. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.

  • 4 authors
·
Jan 13

Automated forest inventory: analysis of high-density airborne LiDAR point clouds with 3D deep learning

Detailed forest inventories are critical for sustainable and flexible management of forest resources, to conserve various ecosystem services. Modern airborne laser scanners deliver high-density point clouds with great potential for fine-scale forest inventory and analysis, but automatically partitioning those point clouds into meaningful entities like individual trees or tree components remains a challenge. The present study aims to fill this gap and introduces a deep learning framework, termed ForAINet, that is able to perform such a segmentation across diverse forest types and geographic regions. From the segmented data, we then derive relevant biophysical parameters of individual trees as well as stands. The system has been tested on FOR-Instance, a dataset of point clouds that have been acquired in five different countries using surveying drones. The segmentation back-end achieves over 85% F-score for individual trees, respectively over 73% mean IoU across five semantic categories: ground, low vegetation, stems, live branches and dead branches. Building on the segmentation results our pipeline then densely calculates biophysical features of each individual tree (height, crown diameter, crown volume, DBH, and location) and properties per stand (digital terrain model and stand density). Especially crown-related features are in most cases retrieved with high accuracy, whereas the estimates for DBH and location are less reliable, due to the airborne scanning setup.

  • 7 authors
·
Dec 22, 2023 1

PrediTree: A Multi-Temporal Sub-meter Dataset of Multi-Spectral Imagery Aligned With Canopy Height Maps

We present PrediTree, the first comprehensive open-source dataset designed for training and evaluating tree height prediction models at sub-meter resolution. This dataset combines very high-resolution (0.5m) LiDAR-derived canopy height maps, spatially aligned with multi-temporal and multi-spectral imagery, across diverse forest ecosystems in France, totaling 3,141,568 images. PrediTree addresses a critical gap in forest monitoring capabilities by enabling the training of deep learning methods that can predict tree growth based on multiple past observations. %Initially focused on French forests, PrediTree is designed as an expanding resource with ongoing efforts to incorporate data from other countries. To make use of this PrediTree dataset, we propose an encoder-decoder framework that requires the multi-temporal multi-spectral imagery and the relative time differences in years between the canopy height map timestamp (target) and each image acquisition date for which this framework predicts the canopy height. The conducted experiments demonstrate that a U-Net architecture trained on the PrediTree dataset provides the highest masked mean squared error of 11.78%, outperforming the next-best architecture, ResNet-50, by around 12%, and cutting the error of the same experiments but on fewer bands (red, green, blue only), by around 30%. This dataset is publicly available on URL{HuggingFace}, and both processing and training codebases are available on URL{GitHub}.

  • 3 authors
·
Sep 1

Eliminating Position Bias of Language Models: A Mechanistic Approach

Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Specifically, we find that causal attention generally causes models to favor distant content, while relative positional encodings like RoPE prefer nearby ones based on the analysis of retrieval-augmented question answering (QA). Further, our empirical study on object detection reveals that position bias is also present in vision-language models (VLMs). Based on the above analyses, we propose to ELIMINATE position bias caused by different input segment orders (e.g., options in LM-as-a-judge, retrieved documents in QA) in a TRAINING-FREE ZERO-SHOT manner. Our method changes the causal attention to bidirectional attention between segments and utilizes model attention values to decide the relative orders of segments instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the segment level. By eliminating position bias, models achieve better performance and reliability in downstream tasks where position bias widely exists, such as LM-as-a-judge and retrieval-augmented QA. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8 to 10 percentage points performance gains in most cases, and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench reasoning subset.

  • 9 authors
·
Jul 1, 2024 1

SegmentAnyTree: A sensor and platform agnostic deep learning model for tree segmentation using laser scanning data

This research advances individual tree crown (ITC) segmentation in lidar data, using a deep learning model applicable to various laser scanning types: airborne (ULS), terrestrial (TLS), and mobile (MLS). It addresses the challenge of transferability across different data characteristics in 3D forest scene analysis. The study evaluates the model's performance based on platform (ULS, MLS) and data density, testing five scenarios with varying input data, including sparse versions, to gauge adaptability and canopy layer efficacy. The model, based on PointGroup architecture, is a 3D CNN with separate heads for semantic and instance segmentation, validated on diverse point cloud datasets. Results show point cloud sparsification enhances performance, aiding sparse data handling and improving detection in dense forests. The model performs well with >50 points per sq. m densities but less so at 10 points per sq. m due to higher omission rates. It outperforms existing methods (e.g., Point2Tree, TLS2trees) in detection, omission, commission rates, and F1 score, setting new benchmarks on LAUTx, Wytham Woods, and TreeLearn datasets. In conclusion, this study shows the feasibility of a sensor-agnostic model for diverse lidar data, surpassing sensor-specific approaches and setting new standards in tree segmentation, particularly in complex forests. This contributes to future ecological modeling and forest management advancements.

  • 5 authors
·
Jan 28, 2024

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

Artificial Intelligence (AI) has made incredible progress recently. On the one hand, advanced foundation models like ChatGPT can offer powerful conversation, in-context learning and code generation abilities on a broad range of open-domain tasks. They can also generate high-level solution outlines for domain-specific tasks based on the common sense knowledge they have acquired. However, they still face difficulties with some specialized tasks because they lack enough domain-specific data during pre-training or they often have errors in their neural network computations on those tasks that need accurate executions. On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well. However, due to the different implementation or working mechanisms, they are not easily accessible or compatible with foundation models. Therefore, there is a clear and pressing need for a mechanism that can leverage foundation models to propose task solution outlines and then automatically match some of the sub-tasks in the outlines to the off-the-shelf models and systems with special functionalities to complete them. Inspired by this, we introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. Unlike most previous work that aimed to improve a single AI model, TaskMatrix.AI focuses more on using existing foundation models (as a brain-like central system) and APIs of other AI models and systems (as sub-task solvers) to achieve diversified tasks in both digital and physical domains. As a position paper, we will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.

  • 14 authors
·
Mar 28, 2023

3D Reconstruction and Information Fusion between Dormant and Canopy Seasons in Commercial Orchards Using Deep Learning and Fast GICP

In orchard automation, dense foliage during the canopy season severely occludes tree structures, minimizing visibility to various canopy parts such as trunks and branches, which limits the ability of a machine vision system. However, canopy structure is more open and visible during the dormant season when trees are defoliated. In this work, we present an information fusion framework that integrates multi-seasonal structural data to support robotic and automated crop load management during the entire growing season. The framework combines high-resolution RGB-D imagery from both dormant and canopy periods using YOLOv9-Seg for instance segmentation, Kinect Fusion for 3D reconstruction, and Fast Generalized Iterative Closest Point (Fast GICP) for model alignment. Segmentation outputs from YOLOv9-Seg were used to extract depth-informed masks, which enabled accurate 3D point cloud reconstruction via Kinect Fusion; these reconstructed models from each season were subsequently aligned using Fast GICP to achieve spatially coherent multi-season fusion. The YOLOv9-Seg model, trained on manually annotated images, achieved a mean squared error (MSE) of 0.0047 and segmentation mAP@50 scores up to 0.78 for trunks in dormant season dataset. Kinect Fusion enabled accurate reconstruction of tree geometry, validated with field measurements resulting in root mean square errors (RMSE) of 5.23 mm for trunk diameter, 4.50 mm for branch diameter, and 13.72 mm for branch spacing. Fast GICP achieved precise cross-seasonal registration with a minimum fitness score of 0.00197, allowing integrated, comprehensive tree structure modeling despite heavy occlusions during the growing season. This fused structural representation enables robotic systems to access otherwise obscured architectural information, improving the precision of pruning, thinning, and other automated orchard operations.

  • 6 authors
·
Jul 2

SciPostLayoutTree: A Dataset for Structural Analysis of Scientific Posters

Scientific posters play a vital role in academic communication by presenting ideas through visual summaries. Analyzing reading order and parent-child relations of posters is essential for building structure-aware interfaces that facilitate clear and accurate understanding of research content. Despite their prevalence in academic communication, posters remain underexplored in structural analysis research, which has primarily focused on papers. To address this gap, we constructed SciPostLayoutTree, a dataset of approximately 8,000 posters annotated with reading order and parent-child relations. Compared to an existing structural analysis dataset, SciPostLayoutTree contains more instances of spatially challenging relations, including upward, horizontal, and long-distance relations. As a solution to these challenges, we develop Layout Tree Decoder, which incorporates visual features as well as bounding box features including position and category information. The model also uses beam search to predict relations while capturing sequence-level plausibility. Experimental results demonstrate that our model improves the prediction accuracy for spatially challenging relations and establishes a solid baseline for poster structure analysis. The dataset is publicly available at https://huggingface.co/datasets/omron-sinicx/scipostlayouttree. The code is also publicly available at https://github.com/omron-sinicx/scipostlayouttree.

  • 3 authors
·
Nov 23

TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning

Model customization necessitates high-quality and diverse datasets, but acquiring such data remains time-consuming and labor-intensive. Despite the great potential of large language models (LLMs) for data synthesis, current approaches are constrained by limited seed data, model biases, and low-variation prompts, resulting in limited diversity and biased distributions with the increase of data scales. To tackle this challenge, we introduce TREESYNTH, a tree-guided subspace-based data synthesis approach inspired by decision trees. It constructs a spatial partitioning tree to recursively divide a task-specific full data space (i.e., root node) into numerous atomic subspaces (i.e., leaf nodes) with mutually exclusive and exhaustive attributes to ensure both distinctiveness and comprehensiveness before synthesizing samples within each atomic subspace. This globally dividing-and-synthesizing method finally collects subspace samples into a comprehensive dataset, effectively circumventing repetition and space collapse to ensure the diversity of large-scale data synthesis. Furthermore, the spatial partitioning tree enables sample allocation into atomic subspaces, allowing the rebalancing of existing datasets for more balanced and comprehensive distributions. Empirically, extensive experiments across diverse benchmarks consistently demonstrate the superior data diversity, model performance, and robust scalability of TREESYNTH compared to both human-crafted datasets and peer data synthesis methods, with an average performance gain reaching 10%. Besides, the consistent improvements of TREESYNTH-balanced datasets highlight its efficacious application to redistribute existing datasets for more comprehensive coverage and the induced performance enhancement. The code is available at https://github.com/cpa2001/TreeSynth.

OAM-TCD: A globally diverse dataset of high-resolution tree cover maps

Accurately quantifying tree cover is an important metric for ecosystem monitoring and for assessing progress in restored sites. Recent works have shown that deep learning-based segmentation algorithms are capable of accurately mapping trees at country and continental scales using high-resolution aerial and satellite imagery. Mapping at high (ideally sub-meter) resolution is necessary to identify individual trees, however there are few open-access datasets containing instance level annotations and those that exist are small or not geographically diverse. We present a novel open-access dataset for individual tree crown delineation (TCD) in high-resolution aerial imagery sourced from OpenAerialMap (OAM). Our dataset, OAM-TCD, comprises 5072 2048x2048 px images at 10 cm/px resolution with associated human-labeled instance masks for over 280k individual and 56k groups of trees. By sampling imagery from around the world, we are able to better capture the diversity and morphology of trees in different terrestrial biomes and in both urban and natural environments. Using our dataset, we train reference instance and semantic segmentation models that compare favorably to existing state-of-the-art models. We assess performance through k-fold cross-validation and comparison with existing datasets; additionally we demonstrate compelling results on independent aerial imagery captured over Switzerland and compare to municipal tree inventories and LIDAR-derived canopy maps in the city of Zurich. Our dataset, models and training/benchmark code are publicly released under permissive open-source licenses: Creative Commons (majority CC BY 4.0), and Apache 2.0 respectively.

  • 8 authors
·
Jul 16, 2024

Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers

Despite their remarkable effectiveness and broad application, the drivers of success underlying ensembles of trees are still not fully understood. In this paper, we highlight how interpreting tree ensembles as adaptive and self-regularizing smoothers can provide new intuition and deeper insight to this topic. We use this perspective to show that, when studied as smoothers, randomized tree ensembles not only make predictions that are quantifiably more smooth than the predictions of the individual trees they consist of, but also further regulate their smoothness at test-time based on the dissimilarity between testing and training inputs. First, we use this insight to revisit, refine and reconcile two recent explanations of forest success by providing a new way of quantifying the conjectured behaviors of tree ensembles objectively by measuring the effective degree of smoothing they imply. Then, we move beyond existing explanations for the mechanisms by which tree ensembles improve upon individual trees and challenge the popular wisdom that the superior performance of forests should be understood as a consequence of variance reduction alone. We argue that the current high-level dichotomy into bias- and variance-reduction prevalent in statistics is insufficient to understand tree ensembles -- because the prevailing definition of bias does not capture differences in the expressivity of the hypothesis classes formed by trees and forests. Instead, we show that forests can improve upon trees by three distinct mechanisms that are usually implicitly entangled. In particular, we demonstrate that the smoothing effect of ensembling can reduce variance in predictions due to noise in outcome generation, reduce variability in the quality of the learned function given fixed input data and reduce potential bias in learnable functions by enriching the available hypothesis space.

  • 3 authors
·
Feb 2, 2024

SeqPE: Transformer with Sequential Position Encoding

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each n-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.

  • 8 authors
·
Jun 16 2

Fine-tuning of Geospatial Foundation Models for Aboveground Biomass Estimation

Global vegetation structure mapping is critical for understanding the global carbon cycle and maximizing the efficacy of nature-based carbon sequestration initiatives. Moreover, vegetation structure mapping can help reduce the impacts of climate change by, for example, guiding actions to improve water security, increase biodiversity and reduce flood risk. Global satellite measurements provide an important set of observations for monitoring and managing deforestation and degradation of existing forests, natural forest regeneration, reforestation, biodiversity restoration, and the implementation of sustainable agricultural practices. In this paper, we explore the effectiveness of fine-tuning of a geospatial foundation model to estimate above-ground biomass (AGB) using space-borne data collected across different eco-regions in Brazil. The fine-tuned model architecture consisted of a Swin-B transformer as the encoder (i.e., backbone) and a single convolutional layer for the decoder head. All results were compared to a U-Net which was trained as the baseline model Experimental results of this sparse-label prediction task demonstrate that the fine-tuned geospatial foundation model with a frozen encoder has comparable performance to a U-Net trained from scratch. This is despite the fine-tuned model having 13 times less parameters requiring optimization, which saves both time and compute resources. Further, we explore the transfer-learning capabilities of the geospatial foundation models by fine-tuning on satellite imagery with sparse labels from different eco-regions in Brazil.

  • 16 authors
·
Jun 28, 2024

PureForest: A Large-scale Aerial Lidar and Aerial Imagery Dataset for Tree Species Classification in Monospecific Forests

Knowledge of tree species distribution is fundamental to managing forests. New deep learning approaches promise significant accuracy gains for forest mapping, and are becoming a critical tool for mapping multiple tree species at scale. To advance the field, deep learning researchers need large benchmark datasets with high-quality annotations. To this end, we present the PureForest dataset: a large-scale, open, multimodal dataset designed for tree species classification from both Aerial Lidar Scanning (ALS) point clouds and Very High Resolution (VHR) aerial images. Most current public Lidar datasets for tree species classification have low diversity as they only span a small area of a few dozen annotated hectares at most. In contrast, PureForest has 18 tree species grouped into 13 semantic classes, and spans 339 km^2 across 449 distinct monospecific forests, and is to date the largest and most comprehensive Lidar dataset for the identification of tree species. By making PureForest publicly available, we hope to provide a challenging benchmark dataset to support the development of deep learning approaches for tree species identification from Lidar and/or aerial imagery. In this data paper, we describe the annotation workflow, the dataset, the recommended evaluation methodology, and establish a baseline performance from both 3D and 2D modalities.

  • 2 authors
·
Apr 18, 2024

ForestSplats: Deformable transient field for Gaussian Splatting in the Wild

Recently, 3D Gaussian Splatting (3D-GS) has emerged, showing real-time rendering speeds and high-quality results in static scenes. Although 3D-GS shows effectiveness in static scenes, their performance significantly degrades in real-world environments due to transient objects, lighting variations, and diverse levels of occlusion. To tackle this, existing methods estimate occluders or transient elements by leveraging pre-trained models or integrating additional transient field pipelines. However, these methods still suffer from two defects: 1) Using semantic features from the Vision Foundation model (VFM) causes additional computational costs. 2) The transient field requires significant memory to handle transient elements with per-view Gaussians and struggles to define clear boundaries for occluders, solely relying on photometric errors. To address these problems, we propose ForestSplats, a novel approach that leverages the deformable transient field and a superpixel-aware mask to efficiently represent transient elements in the 2D scene across unconstrained image collections and effectively decompose static scenes from transient distractors without VFM. We designed the transient field to be deformable, capturing per-view transient elements. Furthermore, we introduce a superpixel-aware mask that clearly defines the boundaries of occluders by considering photometric errors and superpixels. Additionally, we propose uncertainty-aware densification to avoid generating Gaussians within the boundaries of occluders during densification. Through extensive experiments across several benchmark datasets, we demonstrate that ForestSplats outperforms existing methods without VFM and shows significant memory efficiency in representing transient elements.

  • 5 authors
·
Mar 8

ReCo: Region-Controlled Text-to-Image Generation

Recently, large-scale text-to-image (T2I) models have shown impressive performance in generating high-fidelity images, but with limited controllability, e.g., precisely specifying the content in a specific region with a free-form text description. In this paper, we propose an effective technique for such regional control in T2I generation. We augment T2I models' inputs with an extra set of position tokens, which represent the quantized spatial coordinates. Each region is specified by four position tokens to represent the top-left and bottom-right corners, followed by an open-ended natural language regional description. Then, we fine-tune a pre-trained T2I model with such new input interface. Our model, dubbed as ReCo (Region-Controlled T2I), enables the region control for arbitrary objects described by open-ended regional texts rather than by object labels from a constrained category set. Empirically, ReCo achieves better image quality than the T2I model strengthened by positional words (FID: 8.82->7.36, SceneFID: 15.54->6.51 on COCO), together with objects being more accurately placed, amounting to a 20.40% region classification accuracy improvement on COCO. Furthermore, we demonstrate that ReCo can better control the object count, spatial relationship, and region attributes such as color/size, with the free-form regional description. Human evaluation on PaintSkill shows that ReCo is +19.28% and +17.21% more accurate in generating images with correct object count and spatial relationship than the T2I model.

  • 11 authors
·
Nov 23, 2022

MouSi: Poly-Visual-Expert Vision-Language Models

Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.

  • 24 authors
·
Jan 30, 2024 1

Extending LLMs' Context Window with 100 Samples

Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window, constraining their application in downstream tasks with lengthy inputs. Recent studies have sought to extend LLMs' context window by modifying rotary position embedding (RoPE), a popular position encoding method adopted by well-known LLMs such as LLaMA, PaLM, and GPT-NeoX. However, prior works like Position Interpolation (PI) and YaRN are resource-intensive and lack comparative experiments to assess their applicability. In this work, we identify the inherent need for LLMs' attention entropy (i.e. the information entropy of attention scores) to maintain stability and introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window. We validate the superiority of our method in both fine-tuning performance and robustness across different context window sizes on various context-demanding tasks. Notably, our method extends the context window of LLaMA-2-7B-Chat to 16,384 with only 100 samples and 6 training steps, showcasing extraordinary efficiency. Finally, we also explore how data compositions and training curricula affect context window extension for specific downstream tasks, suggesting fine-tuning LLMs with lengthy conversations as a good starting point. We release our code and SFT data at https://github.com/GAIR-NLP/Entropy-ABF.

  • 3 authors
·
Jan 13, 2024 1

Climate-sensitive Urban Planning through Optimization of Tree Placements

Climate change is increasing the intensity and frequency of many extreme weather events, including heatwaves, which results in increased thermal discomfort and mortality rates. While global mitigation action is undoubtedly necessary, so is climate adaptation, e.g., through climate-sensitive urban planning. Among the most promising strategies is harnessing the benefits of urban trees in shading and cooling pedestrian-level environments. Our work investigates the challenge of optimal placement of such trees. Physical simulations can estimate the radiative and thermal impact of trees on human thermal comfort but induce high computational costs. This rules out optimization of tree placements over large areas and considering effects over longer time scales. Hence, we employ neural networks to simulate the point-wise mean radiant temperatures--a driving factor of outdoor human thermal comfort--across various time scales, spanning from daily variations to extended time scales of heatwave events and even decades. To optimize tree placements, we harness the innate local effect of trees within the iterated local search framework with tailored adaptations. We show the efficacy of our approach across a wide spectrum of study areas and time scales. We believe that our approach is a step towards empowering decision-makers, urban designers and planners to proactively and effectively assess the potential of urban trees to mitigate heat stress.

  • 5 authors
·
Oct 9, 2023

Visual Position Prompt for MLLM based Visual Grounding

Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms. The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues. The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations. We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. Training on this dataset with VPP enhances the model's performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets (sim21M samples). The code and VPP-SFT dataset will be available at https://github.com/WayneTomas/VPP-LLaVA upon acceptance.

  • 4 authors
·
Mar 19

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs. The code is publicly available at https://github.com/Haochen-Wang409/DropPos.

  • 6 authors
·
Sep 7, 2023

Impact of a Batter in ODI Cricket Implementing Regression Models from Match Commentary

Cricket, "a Gentleman's Game", is a prominent sport rising worldwide. Due to the rising competitiveness of the sport, players and team management have become more professional with their approach. Prior studies predicted individual performance or chose the best team but did not highlight the batter's potential. On the other hand, our research aims to evaluate a player's impact while considering his control in various circumstances. This paper seeks to understand the conundrum behind this impactful performance by determining how much control a player has over the circumstances and generating the "Effective Runs",a new measure we propose. We first gathered the fundamental cricket data from open-source datasets; however, variables like pitch, weather, and control were not readily available for all matches. As a result, we compiled our corpus data by analyzing the commentary of the match summaries. This gave us an insight into the particular game's weather and pitch conditions. Furthermore, ball-by-ball inspection from the commentary led us to determine the control of the shots played by the batter. We collected data for the entire One Day International career, up to February 2022, of 3 prominent cricket players: Rohit G Sharma, David A Warner, and Kane S Williamson. Lastly, to prepare the dataset, we encoded, scaled, and split the dataset to train and test Machine Learning Algorithms. We used Multiple Linear Regression (MLR), Polynomial Regression, Support Vector Regression (SVR), Decision Tree Regression, and Random Forest Regression on each player's data individually to train them and predict the Impact the player will have on the game. Multiple Linear Regression and Random Forest give the best predictions accuracy of 90.16 percent and 87.12 percent, respectively.

  • 6 authors
·
Feb 22, 2023

Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models

Large Language Models (LLMs) have generated considerable interest and debate regarding their potential emergence of Theory of Mind (ToM). Several recent inquiries reveal a lack of robust ToM in these models and pose a pressing demand to develop new benchmarks, as current ones primarily focus on different aspects of ToM and are prone to shortcuts and data leakage. In this position paper, we seek to answer two road-blocking questions: (1) How can we taxonomize a holistic landscape of machine ToM? (2) What is a more effective evaluation protocol for machine ToM? Following psychological studies, we taxonomize machine ToM into 7 mental state categories and delineate existing benchmarks to identify under-explored aspects of ToM. We argue for a holistic and situated evaluation of ToM to break ToM into individual components and treat LLMs as an agent who is physically situated in environments and socially situated in interactions with humans. Such situated evaluation provides a more comprehensive assessment of mental states and potentially mitigates the risk of shortcuts and data leakage. We further present a pilot study in a grid world setup as a proof of concept. We hope this position paper can facilitate future research to integrate ToM with LLMs and offer an intuitive means for researchers to better position their work in the landscape of ToM. Project page: https://github.com/Mars-tin/awesome-theory-of-mind

  • 4 authors
·
Oct 30, 2023

Position-guided Text Prompt for Vision-Language Pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into Ntimes N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT vilt baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP blip baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at https://github.com/sail-sg/ptp.

  • 4 authors
·
Dec 19, 2022