Fedor Sizov
I am Computational Linguistics bachelor student at the Saarland University.
My main interests are computational linguistics, NLP development and AI technologies.
Here, whenever possible, I collect ideas that seem interesting to me for implementation, but for some reason I haven't had time to tackle them alone yet.
If you think some of them are interesting, I'll be happy to work on it together.
- Previously, I debugged and enhanced an extension which adds time/presupposition support to nltk-drt, under supervision of Ivan Rygaev: https://github.com/sfedia/nltk-drt
- It would be nice to submit a PR to the official nltk repository - this will be a good contribution to the library!
- It seems that using Tesseract in legacy mode allows to train OCR very fast on a specific font of a minor language: on average 4-5 pages are enough to achieve 95% recognition accuracy.
- This technique is successfully applied to books in Naukan, we've already recognized three books so far: https://github.com/sfedia/naukan-ocr
- What if one would create a Docker app which would automate the whole process from making boxes to text recognition?
- The Russian Wiktionary (ru.wiktionary.org) is often being vandalized. Many articles still contain vandal content, often under a layer of useful edits.
- Some previous work was done for automated vandalism detection in English Wikipedia (maybe also English Wiktionary, but I don't know)
- Interesting work could lie in the area of detecting and correcting vandalized edits in Russian Wiktionary.