Skip to content

Latest commit

 

History

History
7 lines (6 loc) · 309 Bytes

TODO.md

File metadata and controls

7 lines (6 loc) · 309 Bytes

Important stuff that needs to be done

  • Tokenizer for LaTeX (LaTeXTrOCR/dataset/tokenizer.py)
  • Parquet to Tensor function (LaTeXTrOCR/dataset/tarquet.py)
  • Scraping arXiv for paragraphs (LaTeXTrOCR/dataset/arxiv.py)
  • Scraping internet for handwritten text + images
  • Layout detection