Important stuff that needs to be done Tokenizer for LaTeX (LaTeXTrOCR/dataset/tokenizer.py) Parquet to Tensor function (LaTeXTrOCR/dataset/tarquet.py) Scraping arXiv for paragraphs (LaTeXTrOCR/dataset/arxiv.py) Scraping internet for handwritten text + images Layout detection