Submitted by unofficialmerve t3_zd3n8s in MachineLearning

Hello 👋🏼 I'm Merve, one of the core devs of this library called skops. In the latest release, we introduced a new serialization format for sklearn models that is more secure than pickle.

You can check this notebook out to see how to use it.

If you want to learn more, check out our docs.

It's very appreciated if you could let us know if you run into any issues by opening an issue on GitHub.

​

obligatory ML meme

142

Comments

You must log in or register to comment.

link0007 t1_iz11xzx wrote

It's so strange the python ML community has still not found a suitable model format, despite years and years of effort. What even happened to efforts like PMML?

Meanwhile I'm quite happy with the R infrastructure with storing tidymodels pipelines.

26

unofficialmerve OP t1_iz12b00 wrote

I think it's because this was raised recently and people really didn't know! since it was raised by Yannic Kilcher couple of months ago, François Chollet has announced a format specific to Keras models, Pytorch folks are cooking something for this too. hopefully this will be solved 🙂

13

ReginaldIII t1_iz1f43w wrote

Tidymodels is a specific example of an R extension package with it's own file format. That would be like saying you are quite happy with the Python infrastructure for saving PyTorch models. It's still specific to that thing.

There are plenty of good ways of storing model weights, those based on hdf5 archives being a great choice since they are optimized for block tensor operations, on disk chunking, support lazy slicing, and support nested groups of tensors. Keras uses hdf5 for it's save_weights and load_weights functions.

If your models are getting huge you need a different strategy anyway. And this is where S3 object store backed systems like TensorStore become more ideal.

8

unofficialmerve OP t1_iz1iybe wrote

h5 and SavedModel of TF are the safest options, yet you can still inject code through Lambda layers or subclassed models (that's why Keras developed a new format too!) AFAIK. What SavedModel does is that it reconstructs the architecture and loads weights into it, and this architecture part is essentially code (loading the weights is never the problem for any framework anyway, it's the code part!). so again, you shouldn't deserialize it. (safest code is no code) if you can see the architecture and confirm that it doesn't have any custom layers, you should be fine. (this is also essentially what we do with skops (we audit the model) (or reconstruct it yourself and load weights into it but it's a little tricky, you might have custom objects or e.g. preprocessing layers for keras)

>The architecture of subclassed models and layers are defined in the methods __init__ and call. They are considered Python bytecode, which cannot be serialized into a JSON-compatible config -- you could try serializing the bytecode (e.g. via pickle), but it's completely unsafe and means your model cannot be loaded on a different system. (in model subclassing guide)
>
>WARNING: tf.keras.layers.Lambda layers have (de)serialization limitations! (in lambda layers guide)

Hugging Face also introduced a new format called safetensors if you're interested: https://github.com/huggingface/safetensors in README there's a detailed explanation & comparison.

6

lmericle t1_iz1kk77 wrote

What about ONNX? Most if not all feedforward models can be represented as ONNX.

7

link0007 t1_iz1nhuf wrote

Yes! I knew there was another standard but I couldn't for the life of me remember the name.

Perhaps it's also just a matter of the Python crowd doing a bit more complicated stuff than the R crowd. For me the models tend to be quite straightforward RF or related models (like I said; tidymodels), but the demand is much more on the correct pipeline with pre- and postprocessing. Things become a bit less easy to store once you go into deep neural networks I'd imagine.

2

unofficialmerve OP t1_iz1nurr wrote

I'm not sure but I was told a lot of times that ONNX support for sklearn was sub-par. I haven't researched that one yet. I can ask to maintainers if you're interested.

1

link0007 t1_iz1p01p wrote

I don't use python so no need! I just remember being quite confused when I was learning sklearn and realised saving models or pipelines was weirdly complicated compared to R.

More generally speaking I suppose the RDS data format is pretty great to work with within R.

2

arsenyinfo t1_iz2hxgi wrote

I deployed sklearn models via onnx in two companies, and it works perfect.

1

Massive_bull_worm t1_iz03ol6 wrote

Why would I care about security when using pickle?

15

RoadsideCookie t1_iz06pow wrote

Because pickle is so easy to attack. It's a format that can be deserialized to pretty much any Python code, and Python can do pretty much anything, so if you unpickle a compromised payload, it's free game.

32

MustachedLobster t1_iz09bzl wrote

Because some people make processed data/pretrained models available online as pickle files.

It'd be nice to be able to open them without having to worry about bad actors nuking my home directory.

22

acamara t1_iz0ziae wrote

Pickle objects can be (almost) anything. Including arbitrary code. Now, imagine a bad actor claiming to be publishing a SOTA Random Forests model. However, embedded in their .pkl file is a statement like import shutils; shutils.rmtree(‘./’);.

Pickle will happily execute this code. There is nothing checking whether or not the pickle file is safe or not.

P.S. of course the syntax is not that simple, but I hope you get it (and I’m on mobile, yada yada…)

21

WERE_CAT t1_iz0l6h7 wrote

I have found it to be about dependance more than security. Dépendance created by pickle are rather strict. I have found it to render significant part of my work unusable when upgrading python version.

4

unofficialmerve OP t1_iz10ekx wrote

It can execute arbitrary code as others said. Other ML frameworks (TF/Keras, PyTorch) are also researching alternative solutions to this at the moment. you should never deserialize a pickle on your local unless it's made by you. pickle is made for python in general, not specifically for machine learning. this format is used to serialize sklearn models/pipelines avoiding pickle.

3

Axel-Blaze t1_iz2755s wrote

Just wanted to drop by and say this is really cool and important work. I follow you on twitter so I've been following updates for a bit longer now lol and i also applied to the skops internship as soon as I saw it when HF opened internships

Thanks for the great work :))

4

-bb_ t1_iz16yje wrote

You also can use joblib.dump, since it already comes with sklearn

−1

unofficialmerve OP t1_iz19n6x wrote

it uses pickle's way of serialization under the hood. the difference between pickle and joblib is that one is performing better with numpy objects AFAIK. you shouldn't deserialize any joblib file on your local.

see one of the sklearn core developers' neat response on difference between them: https://stackoverflow.com/a/12617603

8