[Haskell-cafe] could we get a Data instance for Data.Text.Text?

300 views
Skip to first unread message

Jeremy Shaw

unread,
Jan 22, 2010, 5:24:19 PM1/22/10
to Bryan O'Sullivan, Tom Harper, Duncan Coutts, lo...@seereason.com, haskel...@haskell.org
Hello,

Would it be possible to get a Data instance for Data.Text.Text? This would allow us to create a Serialize instance of Text for use with happstack -- which would be extremely useful.

We (at seereason) are currently using this patch:


which basically adds:

+textType = mkStringType "Data.Text"
+
+instance Data Text where
+   toConstr x = mkStringConstr textType (unpack x)
+   gunfold _k z c = case constrRep c of
+                     (CharConstr x) -> z (pack [x])
+                     _ -> error "gunfold for Data.Text"
+   dataTypeOf _ = textType
+

This particular implementation avoids exposing the internals of the Data.Text type by casting it to a String in toConstr and gunfold. That is similar to how Data is implemented for some numeric types. However, the space usage of casting in Float to a Double is far less than casting a Text to a String, so maybe that is not a good idea?

Alternatively, Data.ByteString just does 'deriving Data'. However, bytestring also exports Data.ByteString.Internal, wheres Data.Text.Internal is not exported.

Any thoughts? I would like to get this handled upstream so that all happstack users can benefit from it.

- jeremy

Bryan O'Sullivan

unread,
Jan 22, 2010, 9:53:52 PM1/22/10
to Jeremy Shaw, lo...@seereason.com, Duncan Coutts, haskel...@haskell.org
On Fri, Jan 22, 2010 at 2:24 PM, Jeremy Shaw <jer...@n-heptane.com> wrote:

Would it be possible to get a Data instance for Data.Text.Text?

From the last time this came up, I gather that the correctish thing to do (for reasons too obscure to me) is to teach SYB and its many cousins about Text, or else there'll be some sort of disturbance in the Force.

If that feels too arduous, I'd consider adding your suggested instance of Data until such time as the One True Generics Package emerges to walk the earth. But please give it a think first.

Neil Mitchell

unread,
Jan 23, 2010, 8:57:45 AM1/23/10
to Bryan O'Sullivan, lo...@seereason.com, Duncan Coutts, haskel...@haskell.org
>> Would it be possible to get a Data instance for Data.Text.Text?
>
> From the last time this came up, I gather that the correctish thing to do
> (for reasons too obscure to me) is to teach SYB and its many cousins about
> Text, or else there'll be some sort of disturbance in the Force.

No, that's definitely not correct, or even remotely scalable as we
increase the number of abstract types in disparate packages. If
someone suggests it's necessary for their generics library, I suggest
you use Uniplate ;-)

There are two options, both listed in the above email.

1) Use string conversion in the instance. This is morally correct, and
works perfectly. However, as mentioned, it's not great performing. The
Map/Set instances both do a similar trick.

2) Just add deriving on the Data type, and hope no one abuses the
internals. This is what ByteString does, it works great, it's fast,
but you are violating some amount of abstraction. You have to trust
people not to break that abstraction, but it's not a simple
abstraction to break - it's the moral equivalent of pointer prodding
in a std::string, no one breaks it accidentally.

> If that feels too arduous, I'd consider adding your suggested instance of
> Data until such time as the One True Generics Package emerges to walk the
> earth. But please give it a think first.

Data.Data is the one true runtime reflection package, so Data
instances are strongly advised, totally ignoring Generics stuff. I
would pick option 2, but a Data instance really is useful.

Thanks, Neil
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Neil Brown

unread,
Jan 23, 2010, 9:55:01 AM1/23/10
to Jeremy Shaw, lo...@seereason.com, Duncan Coutts, haskel...@haskell.org
Jeremy Shaw wrote:
> Hello,
>
> Would it be possible to get a Data instance for Data.Text.Text? This
> would allow us to create a Serialize instance of Text for use with
> happstack -- which would be extremely useful.
Last time this came up, I had a look at providing a Data instance for
Text, and I "got as far as needing a Data instance for ByteString#,
accompanied by an error I don't fully understand, but I think is telling
me that things involving magic hashes are magic:

Data/Text/Array.hs:104:35:
Couldn't match kind `#' against `*'
When matching the kinds of `ByteArray# :: #' and `d :: *'
Expected type: d
Inferred type: ByteArray#
In the first argument of `z', namely `Array' "

The problem with a Data instance for Text is that it is using this
ByteArray# type, which can't easily interact with the Data type-class
because it's a special type. I would suggest providing a Data instance
for ByteArray#, but I don't think that's possible either. As far as I
can understand it all, your Data instance is probably the closest you
are going to get to having a decent Data instance without something else
(GHC/SYB) changing significantly.

Thanks,

Neil.

Jeremy Shaw

unread,
Jan 23, 2010, 5:57:49 PM1/23/10
to Neil Mitchell, lo...@seereason.com, Duncan Coutts, haskel...@haskell.org
 On Sat, Jan 23, 2010 at 7:57 AM, Neil Mitchell <ndmit...@gmail.com> wrote:
 
No, that's definitely not correct, or even remotely scalable as we
increase the number of abstract types in disparate packages.

Yes.. happstack is facing another aspect of this scalability issue as well. We have a class, Serialize, which is used to serialize and deserialize data. It builds on the binary library, but adds the ability to version your data types and migrate data from older versions to newer versions.

This has a serious scalability issue though, because it requires that each type a user might want to serialize has a Serialize instance.

So do we:

  1. provide Serialize instances for as many data types from libraries on hackage as we can, resulting in depending on a large number of packages that people are required to install, even though they will only use a small fraction of them.

  2. convince people that Serialize deserves the same status as Data, and then convince authors to create Serialize instances for their type? It would be nice, but authors will start complaining if they are asked to provide a zillion other instances for their types as well. And they will be annoyed if they their library has to depend on a bunch of other libraries, just so they can provide some instances that only a small fraction of their users might use. So, this method does not scale as the number of 'interesting' classes grows. 

  3. let individual users define the Serialize instances as they need them. Unfortunately, if two different library authors defined a Serialize instance for Text in their libraries, you could not use both libraries in your application because of the conflicting Serialize instances. So this method does not scale when the number of libraries using the Serialize class grows.

Not really sure what the work around is. #1 could work if there was some way to just selectively install the pieces as you need them. But the only way to do this now would be to create a lot of cabal packages which just defined a single instance -- happstack-text, happstack-map, happstack-time, happstack-etc. One for each package that has types we want to create a serialization instance for...

Any other suggestions?

- jeremy


Nicolas Pouillard

unread,
Jan 23, 2010, 6:19:25 PM1/23/10
to Jeremy Shaw, Neil Mitchell, lo...@seereason.com, Duncan Coutts, haskel...@haskell.org
On Sat, 23 Jan 2010 16:57:49 -0600, Jeremy Shaw <jer...@n-heptane.com> wrote:
> On Sat, Jan 23, 2010 at 7:57 AM, Neil Mitchell <ndmit...@gmail.com>wrote:
>
>
> > No, that's definitely not correct, or even remotely scalable as we
> > increase the number of abstract types in disparate packages.
>
>
> Yes.. happstack is facing another aspect of this scalability issue as well.
> We have a class, Serialize, which is used to serialize and deserialize data.
> It builds on the binary library, but adds the ability to version your data
> types and migrate data from older versions to newer versions.
>
> This has a serious scalability issue though, because it requires that each
> type a user might want to serialize has a Serialize instance.
>
> So do we:

[..]

> Any other suggestions?

4. Write a new package:
* serialize-text
* text-instances (which would be a place holder for more instances)

I would go for trying solution 2. and otherwise solution 4.

--
Nicolas Pouillard
http://nicolaspouillard.fr

Derek Elkins

unread,
Jan 23, 2010, 7:02:43 PM1/23/10
to Jeremy Shaw, Duncan Coutts, haskel...@haskell.org, lo...@seereason.com

The only safe rule is: if you don't control the class, C, or you don't
control the type constructor, T, don't make instance C T. Application
writers can often relax that rule as the set of dependencies for the
whole application is known and in many cases any reasonable instance
for a class C and constructor T is acceptable. Under those
conditions, the worst-case scenario is that the application writer may
need to remove an instance declaration when migrating to new versions
of the dependencies. When you control a class C, you should make as
many (relevant) type constructors instances of it as is reasonably
possible, i.e. without adding any extensive dependencies. So at the
very least, all standard type constructors. Similarly for those who
control a type constructor T. This is for convenience. These
correspond to solutions #1 and #2 only significantly weakened.
Definitely, making a package depend on tons of other packages just to
add instances is NOT the correct solution.

The library writers depending on a package for a class and another
package for a type are the problem case. There are three potential
solutions in this case which basically are reduce the problem to one
of the above three cases. Either introduce a new type and add it to a
class, introduce a new class and add the types to it, or try to push
the resolution of such things onto the application writer. The first
two options have the benefit that they also protect you from the
upstream libraries introducing instances that won't work for you.
These two options have the drawback that they are usually less
convenient to use. The last option has the benefit that it usually
corresponds to having a more flexible/generic library, in some cases
you can even go so far as to remove your dependence on the libraries
altogether.

One solution to this problem though it can't be done post-hoc usually,
is to simply not use the class mechanism except as a convenience.
This has the benefit that it usually leads to more flexibility and it
helps to realize the third option above. Using Monoid as an example,
one can provide functions of the form: f :: m -> (m -> m -> m) -> ...
and then also provide f' = f mempty mappend :: Monoid m => ... The
parameters can be collected into a record as well. You could even
systematize this into: class C a where getCDict :: CDict a, and then
write f :: CDict a -> ... and f' = f getCDict :: C a => ...

Whatever one does, do NOT add instances of type constructors you don't
control to classes you don't control. This can lead to cases where
two libraries can't be used together at all.

Lennart Augustsson

unread,
Jan 23, 2010, 7:52:30 PM1/23/10
to Derek Elkins, Duncan Coutts, haskel...@haskell.org, lo...@seereason.com
> The only safe rule is: if you don't control the class, C, or you don't
> control the type constructor, T, don't make instance C T.

I agree in principle, but in the real world you can't live by this rule.
Example, I want to use Uniplate to traverse the tree built by haskell-src-exts,
Using Data.Data is too slow, so I need to make my own instances.
HSE provides like 50 types that need instances, and it has to be
exactly those types.
Also, Uniplate requires instances of a particular class it has.

I don't own either of these packages. Including the HSE instances in
Uniplate would just be plain idiotic.
Including the Uniplate instances with HSE would make some sense, but
would make HSE artificially depend on Uniplate for those who don't
want the instances.

So, what's left is to make orphan instances (that I own). It's not
ideal, but I don't see any alternative to it.

-- Lennart

Neil Mitchell

unread,
Jan 24, 2010, 6:49:53 AM1/24/10
to Lennart Augustsson, lo...@seereason.com, Duncan Coutts, haskel...@haskell.org
Hi,

The problem with Data for Text isn't that we have to write a new
instance, but that you could argue that proper handling of Text with
Data would not be using a type class, but have special knowledge baked
in to Data. That's far worse than the Serialise problem mentioned
above, and no one other than the Data authors could solve it. Of
course, I don't believe that, but it is a possible interpretation.

The Serialise problem is a serious one. I can't think of any good
solutions, but I recommend you give knowledge of your serialise class
to Derive (http://community.haskell.org/~ndm/derive/) and then at
least the instances can be auto-generated. Writing lots of boilerplate
and regularly ripping it up is annoying, setting up something to
generate it for you reduces the pain.

>> The only safe rule is: if you don't control the class, C, or you don't
>> control the type constructor, T, don't make instance C T.
>
> I agree in principle, but in the real world you can't live by this rule.
> Example, I want to use Uniplate to traverse the tree built by haskell-src-exts,
> Using Data.Data is too slow, so I need to make my own instances.
> HSE provides like 50 types that need instances, and it has to be
> exactly those types.
> Also, Uniplate requires instances of a particular class it has.

Read my recent blog post
(http://neilmitchell.blogspot.com/2010/01/optimising-hlint.html), I
optimised Uniplate for working with HSE on top of the Data instances -
it's now significantly faster in some cases, which may mean you don't
need to resort to the Direct stuff. Of course, if you do, then
generating them with Derive is the way to go.

Thanks, Neil

Jeremy Shaw

unread,
Jan 25, 2010, 9:16:53 PM1/25/10
to Neil Mitchell, lo...@seereason.com, Duncan Coutts, haskel...@haskell.org
On Sun, Jan 24, 2010 at 5:49 AM, Neil Mitchell <ndmit...@gmail.com> wrote:
Hi,

The problem with Data for Text isn't that we have to write a new
instance, but that you could argue that proper handling of Text with
Data would not be using a type class, but have special knowledge baked
in to Data. That's far worse than the Serialise problem mentioned
above, and no one other than the Data authors could solve it. Of
course, I don't believe that, but it is a possible interpretation.

Right.. that is the problem with Text. Do you think the correct thing to do for gunfold and toConstr is to convert the Text to a String and then call the gufold and toConstr for String? Or something else? 
 
The Serialise problem is a serious one. I can't think of any good
solutions, but I recommend you give knowledge of your serialise class
to Derive (http://community.haskell.org/~ndm/derive/) and then at
least the instances can be auto-generated. Writing lots of boilerplate
and regularly ripping it up is annoying, setting up something to
generate it for you reduces the pain.

We currently use template haskell to generate the Serialize instances in most cases (though some data types have more optimized encodings that were written by hand). However, you must supply the Version and Migration instances by hand (they are super classes of Serialize).

I am all for splitting the Serialize stuff out of happstack .. it is not really happstack specific. Though I suspect pulling it out is not entirely trivial either. I think the existing code depends on syb-with-class.

- jeremy

José Pedro Magalhães

unread,
Jan 26, 2010, 2:39:20 AM1/26/10
to Jeremy Shaw, Duncan Coutts, haskel...@haskell.org, lo...@seereason.com
Hi Jeremy,

As Neil Mitchell said before, if you really don't want to expose the internals of Text (by just using a derived instance) then you have no other alternative than to use String conversion. If you've been using it already and performance is not a big problem, then I guess it's ok.

Regarding the Serialize issue, maybe I am not understanding the problem correctly: isn't that just another generic function? There are generic implementations of binary get and put for at least two generic programming libraries in Hackage [1, 2], and writing one for SYB shouldn't be hard either, I think. Then you could have a trivial way of generating instances of Serialize, namely something like
instance Serialize MyType where
  getCopy = gget
  putCopy = gput

and you could provide Template Haskell code for generating these. Or even just do
instance (Data a) => Serialize a where ...

if you are willing to use OverlappingInstances and UndecidableInstances...


Cheers,
Pedro

[1] http://hackage.haskell.org/packages/archive/regular-extras/0.1.2/doc/html/Generics-Regular-Functions-Binary.html
[2] http://hackage.haskell.org/packages/archive/multirec-binary/0.0.1/doc/html/Generics-MultiRec-Binary.html

Jeremy Shaw

unread,
Jan 26, 2010, 12:25:48 PM1/26/10
to José Pedro Magalhães, Duncan Coutts, haskel...@haskell.org, lo...@seereason.com
2010/1/26 José Pedro Magalhães <j...@cs.uu.nl>
Hi Jeremy,

As Neil Mitchell said before, if you really don't want to expose the internals of Text (by just using a derived instance) then you have no other alternative than to use String conversion. If you've been using it already and performance is not a big problem, then I guess it's ok.

Regarding the Serialize issue, maybe I am not understanding the problem correctly: isn't that just another generic function? There are generic implementations of binary get and put for at least two generic programming libraries in Hackage [1, 2], and writing one for SYB shouldn't be hard either, I think. Then you could have a trivial way of generating instances of Serialize, namely something like
instance Serialize MyType where
  getCopy = gget
  putCopy = gput

But in what package does, instance Serialize Text, live? text? happstack-data? a new package, serialize-text? That is the question at hand. Each of those choices has rather annoying complications.

As for using generics, Serialization can not be 100% generic, because we also support migration when the type changes. For example, right now ClockTime is defined:

data ClockTime = TOD Integer Integer

Let's say that it is later changed to:

data ClockTime = TOD Bool Integer Integer

Attempting to read the old data you saved would now fail, because the saved data does not have the 'Bool' value. However, perhaps the old data can be migrated by simply setting the Bool to True or False by default. In happstack we would have:

$(deriveSerialize ''Old.ClockTime)
instance Version Old.ClockTime

$(deriveSerialize ''ClockTime)
instance Version ClockTime where
   mode = extension 1 (Proxy :: Proxy Old.ClockTime)

instance Migrate Old.ClockTime ClockTime where
   migrate (Old.TOD i j) = TOD False i j

The Version class is a super class of the Serialize class, which is required so that when the deserializer is trying to deserialize ClockTime, and runs across an older version of the data type, it knows how to find the older deserialization function that works with that version of the type, and where to find the migrate function to bring it up to the latest version.

- jeremy

Jeremy Shaw

unread,
Jan 26, 2010, 12:52:34 PM1/26/10
to Bryan O'Sullivan, Tom Harper, Duncan Coutts, lo...@seereason.com, haskel...@haskell.org
Hello,

Attached is my new and improved patch to add a Data instance to Data.Text. The patch just adds:

+-- This instance preserves data abstraction at the cost of inefficiency.
+-- We omit reflection services for the sake of data abstraction.
+
+instance Data Text where
+  gfoldl f z txt = z pack `f` (unpack txt)
+  toConstr _     = error "toConstr"
+  gunfold _ _    = error "gunfold"
+  dataTypeOf _   = mkNoRepType "Data.Text.Text"


Which is based on what the Data instances for Set and Map do:


Yay for cargo culting!

It seems like this is better than nothing, possibly the correct answer, and if someone does decide to add better instances for toConstr and gunfold in the future, nothing should break? For happstack-data, I think we only need dataTypeOf. 

The instance I posted before definitely did not have valid toConstr / gunfold instances, so I think we would have noticed if we were actually trying to use them..

- jeremy
text-data.patch

Felipe Lessa

unread,
Jan 26, 2010, 12:55:22 PM1/26/10
to haskel...@haskell.org
On Tue, Jan 26, 2010 at 11:52:34AM -0600, Jeremy Shaw wrote:
> + toConstr _ = error "toConstr"
> + gunfold _ _ = error "gunfold"

Isn't it better to write

error "Data.Text.Text: toConstr"

Usually I try to do this as we don't get stack traces for _|_.

--
Felipe.

Jeremy Shaw

unread,
Jan 26, 2010, 1:08:31 PM1/26/10
to haskel...@haskell.org
On Tue, Jan 26, 2010 at 11:55 AM, Felipe Lessa <felipe...@gmail.com> wrote:
On Tue, Jan 26, 2010 at 11:52:34AM -0600, Jeremy Shaw wrote:
> +  toConstr _     = error "toConstr"
> +  gunfold _ _    = error "gunfold"

Isn't it better to write

 error "Data.Text.Text: toConstr"

Usually I try to do this as we don't get stack traces for _|_.


I think so... none of the other instances do.. but I guess that is not a very good excuse :)

- jeremy 

Neil Mitchell

unread,
Jan 26, 2010, 9:21:56 PM1/26/10
to Jeremy Shaw, logic, Duncan Coutts, haskell-cafe
Hi

>> The problem with Data for Text isn't that we have to write a new
>> instance, but that you could argue that proper handling of Text with
>> Data would not be using a type class, but have special knowledge baked
>> in to Data. That's far worse than the Serialise problem mentioned
>> above, and no one other than the Data authors could solve it. Of
>> course, I don't believe that, but it is a possible interpretation.
>
> Right.. that is the problem with Text. Do you think the correct thing to do for gunfold and toConstr is to convert the Text to a String and then call the gufold and toConstr for String? Or something else?

No idea sadly - the SYB stuff was never designed to work with abstract
structures, or structures containing strict/unboxed components.
Converting the Text to a String should work, so in the absence of any
better suggestions, that seems reasonable.

>> The Serialise problem is a serious one. I can't think of any good
>> solutions, but I recommend you give knowledge of your serialise class
>> to Derive (http://community.haskell.org/~ndm/derive/) and then at
>> least the instances can be auto-generated. Writing lots of boilerplate
>> and regularly ripping it up is annoying, setting up something to
>> generate it for you reduces the pain.
>
> We currently use template haskell to generate the Serialize instances in most cases (though some data types have more optimized encodings that were written by hand). However, you must supply the Version and Migration instances by hand (they are super classes of Serialize).
> I am all for splitting the Serialize stuff out of happstack .. it is not really happstack specific. Though I suspect pulling it out is not entirely trivial either. I think the existing code depends on syb-with-class.

If you switch to Derive then you can generate the classes with
Template Haskell, or run the Derive tool as a preprocessor. Derive
abstracts over these details, and also tends to be much easier than
working within Template Haskell (which I always find surprisingly
difficult).

Bryan O'Sullivan

unread,
Jan 31, 2010, 2:34:05 AM1/31/10
to Jeremy Shaw, haskel...@haskell.org
On Tue, Jan 26, 2010 at 10:08 AM, Jeremy Shaw <jer...@n-heptane.com> wrote:

I think so... none of the other instances do.. but I guess that is not a very good excuse :)

Send me a final darcs patch, and I'll apply it.

Jeremy Shaw

unread,
Feb 1, 2010, 3:08:04 PM2/1/10
to Bryan O'Sullivan, haskel...@haskell.org
Attached.

Thanks!
- jeremy
data-instance-for-text.dpatch

Bryan O'Sullivan

unread,
Feb 2, 2010, 1:03:44 AM2/2/10
to Jeremy Shaw, haskel...@haskell.org
On Mon, Feb 1, 2010 at 12:08 PM, Jeremy Shaw <jer...@n-heptane.com> wrote:
Attached.

Data/Text.hs:175:63:
    Module `Data.Data' does not export `mkNoRepType'

Can you send a followup patch that works against GHC 6.10.4, please?

Jeremy Shaw

unread,
Feb 5, 2010, 12:33:13 PM2/5/10
to Bryan O'Sullivan, haskel...@haskell.org
Hello,

I have attached a new version that should work with GHC 6.10, though I have not tested it. 

The older Data.Data uses mkNorepType instead of mkNoRepType. I just changed the patch to use the older spelling. In GHC >= 6.12 this will issue a warning that the old spelling has been deprecated. This seems like a reasonable fix as long as text drops support for GHC 6.10 before mkNorepType is completely removed from Data.Data (which may never happen?):

Here is the bug:

Also, this patch still won't work with GHC < 6.10, is that ok?

I also noticed in the containers package, there are #ifdefs around the Data instances:

#if __GLASGOW_HASKELL__
...
#endif

Should I add that as well? Or is text only supported under GHC anyway?

- jeremy
data-instance-for-text-2.dpatch

Bryan O'Sullivan

unread,
Feb 5, 2010, 3:14:21 PM2/5/10
to Jeremy Shaw, haskel...@haskell.org
On Fri, Feb 5, 2010 at 9:33 AM, Jeremy Shaw <jer...@n-heptane.com> wrote:
I have attached a new version that should work with GHC 6.10, though I have not tested it. 

Thanks. I fixed the compilation warning, added a Data instance for lazy Text, and released 0.7.1.0.
Reply all
Reply to author
Forward
0 new messages