GenAI Weekly News Update 2024-06-17
News Update
Research Update
GenAI Weekly News Update 2024-06-17
AI Model and Product Launch
Anthropic launched Claude 3.5 Sonnet
Highlights: Anthropic launched Claude 3.5 Sonnet and receive great responses from the community and regarded as one of the best model of its kind. A new feature Artifacts also drawed some attention on its novalty interaction.
This first model from Claude 3.5 family outperforming Claude 3 Opus on a wide range of evaluations, with the speed and cost of our mid-tier model, Claude 3 Sonnet. The model is vailable for free on Claude.ai and the Claude iOS app right now, and costs $3 per million input tokens and $15 per million output tokens when using API. The initial community reaction is pretty positive -- intial users like the speed, the great capability in reasoning and math.
Anthropic also introducing Artifacts on Claude.ai, a new feature that expands how users can interact with Claude.It is a dynamic workspace where they can see, edit, and build upon Claude’s creations in real-time, seamlessly integrating AI-generated content into their projects and workflows. This preview feature marks Claude’s evolution from a conversational AI to a collaborative work environment.
Meta unveils five AI models for multi-modal processing, music generation, etc
- Chameleon: Multi-modal text and image processing
- Multi-token prediction for faster language model training
- JASCO: Enhanced text-to-music model
- AudioSeal: Detecting AI-generated speech
- Improving text-to-image diversity
Research Update
Character.ai published research on how to optimizing inference efficiency
Highlights: It is pretty amazing that Characters.ai is now serve over 20,000 inference queries per second (~20% of Google Search's volume). The note contains a lot of useful tips on how to reduce the inference cost on self-host LLM models -- including how to reduce the size of K,V value in attention layer by 20X, how to cache K, V at 95% cache rate, and quantization efficienctly.
Details can be found here.
-
Memory-efficient Architecture Design:
- Multi-Query Attention: Reduces KV cache size by 8X
- Hybrid Attention Horizons: Interleaves local and global attention layers
- Cross Layer KV-sharing: Further reduces KV cache size by 2-3x
-
Stateful Caching:
- Caches attention KV on host memory between chat turns
- Uses a rolling hash system for efficient retrieval
- Achieves a 95% cache rate
-
Quantization for Training and Serving:
- Uses int8 quantization on model weights, activations, and attention KV cache
- Trains models natively in int8 precision
These innovations have reduced serving costs by a factor of 33 since late 2022. Their current system is estimated to be 13.5X more cost-efficient than using leading commercial APIs.
CVPR 2024 Announces Best Paper Award Winners
Highlights: CVPR 2024 has announced 4 best papers and 6 best student papers out of a record of more than 11,500 paper submissions. We summarize the 4 best papers below:
Best Papers
Generative Image Dynamics
topic: image animation. Highlights: The paper presents a new approach for modeling natural oscillation dynamics from a single still picture, and can be used to create looping videos, or allowing users to interact with objects in real images
- Problems:
Generating realistic motion in static images is challenging due to complex physical dynamics. Existing methods often produce artifacts, lack temporal coherence, or require additional inputs.
- What's New:
- Representation: Created a generative image-space prior on scene motion learned from real video sequences.Use spectral volumes as a motion representation, which is well-suited for diffusion models.
- Prediction: The author also created a frequency-coordinated diffusion sampling process to predict spectral volumes from single images
-
Image-based rendering: describe how to use spectral volume and animate the input image using the predicted motion
-
Conclusion: The proposed approach significantly outperforms prior single-image animation baselines in terms of image and video synthesis quality. It can generate photo-realistic animations from a single picture without degradation over time. It enables several downstream applications such as creating seamlessly looping videos and interactive image dynamics.
Rich Human Feedback for Text-to-Image Generation
topic: Human Feedback on T2I generation Highlights: the paper published first rich human feedback data and model for text-to-image generation(on implausibility, aesthetics and text-image alignment), and can be used to improve image generation quality through HF data.
-
Problems: Recent text-to-image (T2I) generation models have made significant progress, but many generated images still suffer from issues such as artifacts, implausibility, misalignment with text descriptions, and low aesthetic quality. Existing evaluation metrics and human feedback methods typically provide only single-score summaries, which lack detailed, actionable insights on image quality and text-image alignment.
-
What's New:
- The first Rich Human Feedback dataset (RichHF-18K) on generated images (consisting of fine-grained scores, implausibility(artifact)/misalignment image regions, and misalignment keywords), on 18K Pick-a-Pic images.
- A multimodal Transformer model (RAHF) to predict rich feedback on generated images, which we show to be highly correlated with the human annotations on a test set.
- Prove usefulness of the predicted rich human feedback by RAHF to improve image generation:
- by using the predicted heatmaps as masks to inpaint problematic image regions
- by using the predicted scores to help finetune image generation models
- The improvement on the Muse model, which differs from the models that generated the images in our training set, shows the good generalization capacity of our RAHF model.
-
Conclusion: The predicted rich feedback can be effectively used to improve image generation through methods like finetuning and region inpainting.
Honorable mention papers
EventPS: Real-Time Photometric Stereo Using an Event Camera
topic: photometric stereo imaging. Highlights: the paper showed a new method to create the estimated surface of a 3d-object under different lighting conditions with lower data requirements. It capable to processing data at 30fps in real world senarios.
The paper proposed a novel approach that uses event cameras for real-time photometric stereo, estimating surface normals from continuous radiance changes under varying lighting conditions. By leveraging the high temporal resolution and low bandwidth characteristics of event cameras, EventPS achieves comparable performance to traditional frame-based methods while significantly reducing data requirements. The method demonstrates real-time performance of over 30 fps in real-world scenarios, opening up possibilities for high-speed 3D reconstruction and other time-sensitive applications.
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
topic: reconstruct 3D fields. Highlights: the paper showed a new method on 3D reconstruction, featuring real-time and memory-efficient rendering for training and fast 3D reconstruction at inference time.
The paper proposed a novel method that reconstructs 3D radiance fields using Gaussian primitives from just a pair of input images, enabling real-time and memory-efficient rendering for training and fast 3D reconstruction at inference time. The approach uses a multi-view epipolar transformer to resolve scale ambiguity and introduces a probabilistic prediction scheme for Gaussian parameters to overcome local minima issues. Experiments show that pixelSplat outperforms state-of-the-art light field transformers on real-world datasets while significantly reducing rendering time and memory usage, and producing an interpretable and editable 3D representation.