-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transliterate #19
Comments
In general no. You could use something like In your case, you could do a little better since the transliteration process probably operates character-by-character. Something like tokenizer = CharacterTokenizer('und') # or 'en-US', etc.
builder = BistrBuilder(text)
for token in tokenzier.tokenize(text):
builder.replace(token.end - token.start, unidecode(token.modified))
text = builder.build() By the way, it's on my backlog to implement support for ICU's Transliterator API which is more powerful than unidecode and similar things. |
So since ovalhub/pyicu#107 was implemented, I've tested out an implementation that wraps a
|
Thank you for the great info and tips. Agreed that transliteration doesn't always make sense to do, e.g., your example. I realize now why I didn't think to do it the way you mentioned. I had it in my mind that bistr keeps track of each operations output instead of always overriding modified, i.e., modified is a list so one could rollback to a certain state. I had built this into my own version of this. The use case being that I could see which operation the caused the string transformation train to derail. |
Ah I see, but that would be polystring, not bistring :). More seriously, I am considering adding a data type that would retain an entire history of transformations, rather than just the initial and final states. The Emacs region-specific undo buffer stuff seems to have that, for example, but I'm not sure what encoding they use. I imagine it's a persistent stack of ropes or something. |
I was hoping you might advise me on how to incorporate transliteration into a text transformation pipeline.
Let's say I want to use a 3rd party library like
from unidecode import unidecode
.I could create a bistring with
new_bistr = bistr(text.modified, unidecode(text.modified))
but I would loose all the previous operations.
Is there a way to fold in a modified string that is calculated outside bistring's capabilities?
The text was updated successfully, but these errors were encountered: