Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Charset by extension #110

Open
triska opened this issue Mar 4, 2018 · 9 comments
Open

Feature request: Charset by extension #110

triska opened this issue Mar 4, 2018 · 9 comments

Comments

@triska
Copy link
Member

triska commented Mar 4, 2018

Use case: I would like to serve UTF-8 encoded *.txt files.

When I use the following server:

:- use_module(library(http/thread_httpd)).
:- use_module(library(http/http_unix_daemon)).
:- use_module(library(http/http_files)).

:- http_handler(root(.), http_reply_from_files(., []), [prefix]).

then I can fetch *.txt files. However, the content-type in responses is:

Content-Type: text/plain

whereas I would like to get responses such as:

Content-Type: text/plain; charset=UTF-8

Is there an easy way to configure the SWI infrastructure to specify charset=UTF-8 when serving *.txt files? Alternatively, would you please consider adding such a feature? Thank you!

@wouterbeek
Copy link
Contributor

@triska There is a hook http:mime_type_encoding/2 in library http/http_header that can be used to associate encodings with Media Types. http_reply_file/3 already uses hook mime:mime_extension/2 to associate Media Types with the file extension of served files. Maybe http_reply_file/3 could be taught to not only use the Media Type associations, but also their encodings as per http:mime_type_encoding/2?

@JanWielemaker
Copy link
Member

Complicated issue. The good news is that more and more tools seem to encourage people adding encoding/charset declarations so we can reduce the guessing we need. As Wouter knows better than me, the claim is often wrong, so we still have a long way to go ...

The hook in http_header.pl looks promising, but serves a different purpose. It is there to ensure that the HTTP streams have the correct Prolog encoding, so it is facing inwards rather than outwards en targeted at documents you generate in Prolog rather than static files. In my experience using something that is meant for A in the context of B is asking for trouble.

Probably the best way is to extend the mimetype library. I'm not entirely sure how though. There is no sensible default encoding for text files in general. There may be one for a particular deployment, probably defined on a combination of the current locale, content of the files and intended audience.
For example, if you have a pure Russian website, using the 'KOI8-R' charset makes perfect sense,
as ISO latin 1 does for a group of languages. If you have a heterogeneous set of files you probably use UTF-8.

Could we derive the default from the Prolog encoding/locale? Probably hard as locale names differ by OS and are not standardized AFAIK. On the other hand, if we have a text file and the Prolog encoding is utf8 we could assume that adding charset=UTF-8 as a default makes sense. This would indicate we need an indirection, first from media type to text/binary and then to charset for the text files.

Does this make sense?

@triska
Copy link
Member Author

triska commented Mar 4, 2018

In my opinion, at least for *.txt files, almost any concrete (and fixed) default charset that can represent a reasonable set of languages is better than nothing.

If there were, in addition, a way to map file extensions to charsets (in analogy to the already existing extension → content-type mapping), I would already consider it a huge improvement, because it would at least let users specify the encoding they are using for their files, whether it is ISO Latin-1, Shift JIS, UTF-8 or KOI8-R etc.

For comparison, please see the Apache directive AddCharset:

AddCharset EUC-JP .euc
AddCharset ISO-2022-JP .jis
AddCharset SHIFT_JIS .sjis

This directive lets you specify charsets by extension and even for particular files individually:

<Files "example.html">
AddCharset UTF-8 .html
</Files>

One or two extensible predicates for corresponding settings in SWI-Prolog would be a very welcome addition!

@JanWielemaker
Copy link
Member

I agree this needs a solution. Just, how? Based on Wouter's comment I considered associating a
media type with a charset. Problem then of course is that you cannot address individual files or
sets of files that share the same media type. This suggests adding it to the file name/extension, which is what the mimetype.pl also does for media types. It seems rather unpractical to replicate this for charsets while in 99% of the cases you just want to say all text files are UTF-8 (or something else).

I'm thinking of something like this:

  • Have a mapping from media type to text/binary (hooked using a multifile predicate)

  • Have a Prolog flag or similar that defines the charset of text files.

  • Have a hookable rule that operates on the filename. The default computes the
    media type, uses the above mapping to determine it is a text file and then the flag to
    determine the charset. You can hook it to do whatever you like. We should probably
    make the full filename accessible from the hook, so you can do things such as checking
    for a BOM marker or look for a config file in the directory of the file.
    I would consider something like

    charset(+FileName, +Mediatype, -Charset) is semidet.
    

Add this stuff to mimetype.pl.

Does that cover what we want?

@triska
Copy link
Member Author

triska commented Mar 5, 2018

I think http:charset/3 as you show it in your last bullet point would be a good solution.

The other approaches you mention (in particular a tight coupling between content types and encodings) seem somewhat dubious and too inflexible to me.

The following thread contains some interesting settings that people find useful in practice:

https://stackoverflow.com/questions/913869/how-to-change-the-default-encoding-to-utf-8-for-apache

Please consider in particular the following:

<Files ~ "\.html?$">
     Header set Content-Type "text/html; charset=utf-8"
</Files>

That's clearly much more flexibility than is needed for this concrete issue. However, I still find this very interesting: You can configure Apache to emit particular header fields based on file names. This flexibility could be useful in SWI-Prolog too.

The http:charset/3 hook is of course a special case of such a more general mechanism, and would be great for the particular case at hand.

@JanWielemaker
Copy link
Member

JanWielemaker commented Mar 8, 2018

I pushed cd44ae4 which I think both provides a fair default as well as the option to take it all in your hand. Please have a look and close it solves your problems.

@triska
Copy link
Member Author

triska commented Mar 9, 2018

Thank you! The description of file_content_type/[2,3] is unclear to me: In particular, the following sequence seems not to capture what is actually implemented:

1. Determine the media type using file_mime_type/2
2. Determine it is a text file using text_mimetype/1
3. Use the charset from the Prolog flag `default_charset`

I do not know whether the description or code is (in)correct, but it seems what is meant is:

  1. Unless it is already specified, determine MediaType using file_mime_type/2.
  2. If the media type indicates a text file, derive and indicate the associated charset.
  3. For other media types, use the media type but do not indicate a charset.

The description of the hooks and flags seems also not to capture what is actually happening. In particular, I suggest:

  • mime:charset/3 derives the charset for a file with a given media type, if the media type is text according to mime:text_mimetype/1

To me personally, it is also a bit surprising that the charset indication now hinges on text MIME types. Certainly one can think of MIME types such as (hypothetical) model/autocad-xml where you would also like to indicate a charset?

@JanWielemaker
Copy link
Member

Thanks for the suggestions. Applied. The idea is to use mime:text_mimetype/1 to define those mime types that you have as files in your current locale on disk. That should suffice for most setups. In the case you have text files using different encodings, still use mime:text_mimetype/1 and then the charset hook to do whatever is needed (check BOM, check extension, look for meta-data in the dir, ...)

@triska
Copy link
Member Author

triska commented Mar 11, 2018

Thank you, it works!

At least the charset parameter that is mentioned in the following TBD seems now fully implemented:

https://github.com/SWI-Prolog/packages-http/blob/master/mimetype.pl#L49

If applicable, please consider removing the TBD or citing a different example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants