We introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results