Every two weeks or so I peruse github looking for something like this and I have to say this looks really promising. In statistical genetics we make really big scatterplots called Manhattan plots https://en.wikipedia.org/wiki/Manhattan_plot and we have to use all this highly specialized software to visualize at different scales (for a sense of what this looks like: https://my.locuszoom.org/gwas/236887/). Excited to try this out
Hey! This sounds like a really interesting use case. If you run into any issues or need help with the visualization, please don't hesitate to post an issue on the repo. We can also think about adding an example demo of a manhattan plot to help too!
I always thought it was interesting that my modern CPU takes ages to plot 100,000 or so points in R or Python (ggplot2, seaborn, plotnine, etc) and yet somehow my 486DX 50Mhz could pump out all those pixels to play Doom interactively and smoothly.
I followed one of their online workshops, and it feels really powerful, although it is a bit confusing which part of it does what (it's basically 6 or 7 projects put together under an umbrella)
> powered by WGPU, a cross-platform graphics API that targets Vulkan (Linux), Metal (Mac), and DX12 (Windows).
The fact that they are using WGPU, which appears to be a Python native implementation of WebGPU, suggests there is an interesting possible extended case. As a few other comments suggest, if one knows that the data is available on a machine in a cluster rather than on the local machine of a user, it might make sense to start up a server, expose a port and pass along the data over http to be rendered in a browser. That would make it shareable across the lab. The limit would be the data bandwidth over http (e.g. for the 3 million point case) but it seems like for simpler cases it would be very useful.
That would lead to an interesting exercise of defining a protocol for transferring plot points over http in such a way they could be handed over to a the browser WebGPU interface efficiently. Perhaps even a more efficient representation is possible with some pre-processing on the server side?
What is GSP in this context? Searching Python GSP brings up Generalized Sequence Pattern (GSP) algorithm [1] and Graph Signal Processing [2], neither of which seem to be a protocol. I also found "Generic Signaling Protocol" and "Global Sequence Protocol" which also don't seem relevant. Forgive me if GSP is some well know thing which I am just not familiar with.
Forgive me for doing this, but I used an LLM to find that. They’re exceptionally useful for disambiguation tasks like this. Knowing what an acronym refers to is very useful for next token prediction, so they’re quite good at it. It’s usually trivial to figure out if they’re hallucinating with a search engine.
Do you have any numbers for the rough number of datapoints that can be handled? I'm curious if this enables plotting many millions of datapoints in a scatterplot for example.
Yes! The number of data points can range in the millions. Quite honestly, the quality of your GPU would be the limiting factor here. I will say, however, that for most use cases, an integrated GPU is sufficient. For reference, we have plotted upwards of 3 million points on a mid-range integrated GPU from 2017.
I will work on adding somewhere in our docs some metrics for this kind of thing (I think it could be helpful for many).
I have watched recordings of your recent representation and decided to finally give it a try last week. My goal is to create some interactive network visualizations - like letting you click/box select nodes and edges to highlight subgraphs which sounds possible with the callbacks and selectors.
Haven't had the time to get very far yet, but will gladly contribute an example once I figure something out. Some of the ideas I want to eventually get to is to render shadertoys(interactively?) into a fpl subplot (haven't looked at the code at all, but might be doable), eventually run those interactively in the browser and do the network layout on the GPU with compute shaders (out of scope for fpl).
Hi! I've seen some of your work on wgpu-py! Definitely let us know if you need help or have ideas, if you're on the main branch we recently merged a PR that allows events to be bidirectional.
But it doesn't seem to answer how it works in Jupyter notebooks, or if it does at all. Is the GPU acceleration done "client-side" (JavaScript?) or "server-side" (in the kernel?) or is there an option for both?
Because I've used supposedly fast visualization libraries in Google Colab before, but instead of updating at 30 fps, it takes 2 seconds to update after a click, because after the new image is rendered it has to be transmitted via the Jupyter connector and network and that can turn out to be really slow.
Thanks. Yeah I've been baffled as to why just interactive Matplotlib with a Colab kernel is so slow. The Colab CPU is fast (enough), the network is fast, I haven't been able to figure out where the bottleneck is either.
Is google colab slower than an equivalently powerful kernel running on a remote jupyter kernel? Are you running into network problems, or is it something specific to colab?
This looks super cool! Looking forward to trying it.
I think a killer feature of these gpu-plotting libraries would be if they could take torch/jax cuda arrays directly and not require a (slow) transfer over cpu.
Thanks! That is a great question and one that I've we've been battling with as well. As far as we know, this is not possible due to the way different contexts are set up on the GPU https://github.com/pygfx/pygfx/issues/510
I've been looking into this issue with Datoviz [1] following a user request. It turns out there may be a way to achieve it using Vulkan [2] (which Datoviz is based on) and CuPy's UnownedMemory [3]. I wrote a simple proof of concept using only Vulkan and CuPy.
I'm now working on a way for users to wrap a Datoviz GPU buffer as a CuPy array that directly references the Datoviz-managed GPU memory. This should, in principle, enable efficient GPU-based array operations on GPU data without any transfers.
This looks cools thanks! Makes me wonder if there's any way to do that with WGPU if WGPU is interfacing with Vulkan, probably not easy if possible I"m guessing.
WGPU has security protections since it's designed for the browser so I'm guessing it's impossible.
Wow. So are you saying that you can have some array on the GPU that you setup with python via CuPy, then you call to the webbrowser and give it the pointer address for that GPU array, and the browser through WASM/WebGPU can access that same array? That sounds like a huge browser security hole.
Yea the security issue is why I'm pretty sure you can't do it on WGPU, but Vulkan and cupy can fully run locally so it doesn't have the same security concern.
> When would you reach for a different library instead of fastplotlib?
Use the best tool for your usecase, we're focused on GPU accelerated interactive visualization. Our use cases broadly are developing ML algorithms, user-end ML Ops tools, and looking live data off of live scientific instruments.
> How does this deal with really large datasets? Are you doing any type of downsampling?
Looking forward to checking out your library, thanks for sharing it with the world.
I’ve been using kst-plot for live streaming data from instruments and interactive plots. It’s fast and I haven’t found any limit for the amount of data it can plot. Development has basically stopped - the product is done, feature complete, and works perfectly! It is used by European and Canadian space agencies. Maybe it will be interesting to you to see how they have solved or approached some of the same problems you have solved or will also solve !
That would be preposterous if it wasn't so hilariously false:
> These days, having a GPU is practically a prerequisite to doing science, and visualization is no exception.
It becomes really funny when they go on to this, as if it was a big deal:
> Depicted below is an example of plotting 3 million points
Anybody who has ever used C or fortran knows that a modern CPU can easily churn through "3 million points" at more than 30 frames per second, using just one thread. It's not a particularly impressive feat, three million points is the size of a mid-resolution picture, and you can zoom-in and out those trivially in real-time using a CPU (and you could do that 20 years ago, as well). Maybe the stated slowness of fastplotlib comes from the unholy mix of rust and python?
Now, besides this rant, I think that fastplotlib is fantastic and, as an (unwilling) user of Python for data science, it's a godsend. It's just that the hype of that website sits wrong in me. All the demos show things that could be done much easier and just as fast when I was a teenager. The big feat, and a really big one at that, is that you can access this sort of performance from python. I love it, in a way, because it makes my life easier now; but it feels like a self-inflicted problem was solved in a very roundabout way.
>> Depicted below is an example of plotting 3 million points
> Anybody who has ever used C or fortran knows that a modern CPU can easily churn through "3 million points" at more than 30 frames per second, using just one thread. It's not a particularly impressive feat, three million points is the size of a mid-resolution picture, and you can zoom-in and out those trivially in real-time using a CPU (and you could do that 20 years ago, as well). Maybe the stated slowness of fastplotlib comes from the unholy mix of rust and python?
That's a misrepresentation though, it's 3 million points in sine waves, e.g. something like 1000 sine waves with e.g. 3000 points in each. If you look at the zoomed in image, the sine waves are spaced significantly, so if you would represent this as an image it would be at least a factor 10 larger. In fact that is likely a significant underestimation, i.e. you need to connect the points inside the sine waves as well.
The comparison case would be to take a vector graphics (e.g. svg) with 1000 sine wave lines and open it in a viewer (written in C or Fortran if you want) and try zooming in and out quickly.
Thanks, and the purpose was to show what's possible on modest hardware that most people have. We have created gigabytes of graphics that live on the gpu for more complex use cases and they remain performant, but you need a gaming gpu.
But why do you want to fit the whole dataset in memory? If the dataset is stored in a tiled and multi-scaled representation you need to only grab the part of it that is needed to fit your screen (which is a constant, small amount of data, even if the dataset is arbitrarily large).
If you insist to fit the entire thing in memory, it may seem better to do so in the plain RAM, which nowadays is of humongous size even in "modest" systems.
I know 3D is in the roadmap. Once the basic functionality is in place, it would be great to also consider integrating molecular visualization or at least provide enough fast primitives to simplify the integration of molecular visualization tools with this library.
We are definitely looking forward to adding more 3D graphics in the future, and this sounds really cool. Would you mind posting an issue on the repo? I think this is something we would want to have on the roadmap or at least an open issue to plan out how we could do this. Thanks!
I’m often working with a windows desktop and a remote Linux box on which I have my data & code. I’d like to plot “locally” on my desktop workstation from the remote host. This usually either means using X11 (slow) or some sort of web-based library like plotly. Does fastplotlib offer any easy solution here?
I'm in the same boat as the person you replied to, but have zero experience with remote plotting other that doing static plots in in a remote session in the interactive window provided by VS Code's python extension. Would this also work there, or would I have to start using jupyter notebooks?
non-jupyter notebook implementations have their quirks, eventually we hope to make a more universal jupyter-rfb kind of library, perhaps using anywidget. Anywidget is awesome: https://github.com/manzt/anywidget
People have used fastplotlib and jupyter-rfb in vscode, but it can be troublesome and we don't currently have the resources to figure out exactly why.
Sometimes I wish these plotting libraries were more portable beyond Python only. I was looking for something similar for Ruby just a while ago but the install instructions seemed out of date and unsupported on Windows.
Not sure if that is the right tutorial, but many years ago in the guile 1.x days I wrote a local visualizer for the data from a particle physics accelerator entirely in Guile and Gnuplot. It was very MVC and used guile as the controller and Gnuplot as the viewer.
Was it stupid? Yes. Did it work better than all the other tools I had at the time? Also yes.
We love imgui! Big thanks to the imgui devs, and Pascal Thomet who maintains the python bindings for imgui-bundle, and https://github.com/panxinmiao who made an Imgui Renderer for wgpu-py!
Looks very interesting for interactive visualization. I like the animation interface. Also love imgui, glad to see it here. I wish I had better plotting tools for publication quality images (though, honestly I'm pretty happy with matplotlib).
Thanks! I used pyqtgraph for many years and love what can be done by it, we started off wanting to build something like it but based on WGPU and not bound to Qt.
One of the big bottlenecks of plotting libraries is simply the time it takes to import the library. I’ve seen matplotlib being slow to import, and in Julia they even have a “time to first plot” metric. I’d be curious to see how this library compares.
I think one nice thing that we have tried to do is limit super heavy dependencies and also separate optional dependencies to streamline things.
The quickest install would be `pip install fastplotlib`. This would be if you were interested in just having the barebones (no imgui or notebook) for desktop viz using something like glfw.
We can think about adding in our docs some kind of import time metrics.
This library builds upon pygfx and wgpu-py. Unfortunately, the latter doesn't support running on WASM, pyscript or pyodide yet, but there's an issue about it:
I’m not making neuroscience visualizations. I’m working with rather line graphs and would like to animate based on ~10000 points. I’m looking to convert these visuals to video for youtube, in hd and at 60fps using the HEVC/h.265 codec. I took a quick look at the documentation to see if this is possible and I didn’t see anything. Are or will this sort of rendering be supported?
I previously tried this on matplotlib and it took 20-30 minutes to make a single rendering because matplotlib only uses a single core on a cpu and doesn’t support gpu acceleration. I also tried Man im, but I couldn’t get an actual video file, and opengl seems to be a bit complicated to work with (I went and worked on other things though I should ask around about the video file output). Anyway, I’m excited about the prospect of a gpu accelerated dataviz tool that utilizes Vulkan, and I hope this library can cover my usecase.
Very cool effort. That said, and it's probably because of the kind of work that I do, but I have almost never found the four challenges to be any kind of a problem for me. Although I do think there is some kind of contradiction there. Plotting (exploratory data analyis ("EDA"), really) is all about distilling key insights and finding features hidden in data. But you have to some kind of intuition about where the needle in the haystack is. IME, throwing up a ton of plots and being able to scrub around in them never seems to provide much insight. It's also very fast---usually the feedback loop is like "make a plot, go away and think about it for an hour, decide what plot I need to make next, repeat". If there is too much data on the screen it defeats the point of EDA a little bit.
For me, matplotlib still reigns supreme. Rather than a fancy new visualization framework, I'd love for matplotlib to just be improved (admittedly, fastplotlib covers a different set of needs than what matplotlib does... but the author named it what they named it, so they have invited comparison. ;-) ).
Two things for me at least that would go a long way:
1) Better 3D plotting. It sucks, it's slow, it's basically unusable, although I do like how it looks most of the time. I mainly use PyVista now but it sure would be nice to have the power of a PyVista in a matplotlib subplot with a style consistent with the rest of matplotlib.
2) Some kind of WYSIWYG editor that will let you propagate changes back into your plot easily. It's faster and easier to adjust your plot layout visually rather than in code. I'd love to be able to make a plot, open up a WYSIWYG editor, lay things out a bit, and have those changes propagate back to code so that I can save it for all time.
(If these features already exist I'll be ecstatic ;-) )
I have to agree with your point about EDA. The library is neat, but even the example of covariance matrix animation is a bit contrived.
Every pixel has a covariance with every other pixel, so sliding though the rows of the covariance matrix generates as many faces on the right as there are pixels in a photograph of a face. However the pixels that strongly co-vary will produce very similar right side "face" pictures. To get a sense of how many different behaviours there are one would look for eigenvectors of this covariance matrix. And then 10 or so static eigenvectors of the covariance matrix (eigenfaces [1]) would be much more informative than thousands of animated faces displayed in the example.
Some times a big interactive visualisation can be a sign of not having a concrete goal or not knowing how to properly summarise. After all that's the purpose of a figure - to highlight insights, not to look for ways to display the entire dataset. And pictures that try to display the whole dataset end up shifting the job of exploratory analysis to a visual space and leave it for somebody else.
Hi, one of the other devs here. As the poster below pointed out what you're missing is that in this case we know that an eigendecomposition or PCA will be useful. However if you're working on matrix decomposition algorithms like us, or if you're trying to design new forms of summary matrices because a covariance matrix isn't informative for your type of data then these types of visualizations are useful. We broadly work on designing new forms of matrix decomposition algorithms so it's very useful to look at the matrices and then try to determine what types of decompositions we want to do.
ok, different libraries have different use cases, the type of data we work with absolutely necessitates dynamic visualization. You wouldn't view a video with imshow would you?
Every time I've needed to scrub through something in time like that, dumping a ton of frames to disk using imshow has been good enough. Usually, the limiting factor is how quickly I can generate a single frame.
It's hard for me to imagine what you're doing that necessitates such fancy tools, but I'm definitely interested to learn! My failure of imagination is just that.
The example from the article with the subtitle "Large-scale calcium imaging dataset with corresponding behavior and down-stream analysis" is a good example. We have brain imaging video that is acquired simultaneously with behavioral video data. It is absolutely essential to view the raw video at 30-60Hz.
Aren't you missing the entire point of exploratory data analysis? Eigenfaces are an example of what you can come up with as the end product of your data exploration, after you've tried many ways of looking at the data and determined that eigenfaces are useful.
Your whole third paragraph seems to be criticizing the core purpose of exploratory data analysis as though one should always be able to skip directly to the next phase of having a standardized representation. When entering a new problem domain, somebody needs to actually look at the data in a somewhat raw form. Using the strengths of the human vision system to get a rough idea of what the typical data looks like and the frequency and character of outliers isn't dumping the job of exploratory data analysis onto the reader, it's how the job actually gets done in the first place.
> Using the strengths of the human vision system to get a rough idea of what the typical data looks like and the frequency and character of outliers isn't dumping the job of exploratory data analysis onto the reader, it's how the job actually gets done in the first place.
Yup this is a good summary of the intent, we also have to remember that the eigenfaces dataset is a very clean/toy data example. Real datasets never look this good, and just going straight to an eigendecomp or PCA isn't informative without first taking a look at things. Often you may want to do something other than an eigendecomp or PCA, get an idea of your data first and then think about what to do to it.
Edit: the point of that example was to show that visually we can judge what the covariance matrix is producing in the "image space". Sometimes a covariance matrix isn't even the right type of statistic to compute from your data and interactively looking at your data in different ways can help.
As a whole, of course you have a point - big visualisations when done properly should help with data exploration. However, from my experience they rarely (but not never) do. I think it's specific to the type of data you work with and the visualisation you employ. Let me give an example.
Imagine we have some big data - like an OMIC dataset about chromatin modification differences between smokers and non-smokers. Genomes are large so one way to visualise might be to do a manhattan plot (mentioned here in another comment). Let's (hypothetically) say the pattern in the data is that chromatin in the vicinity of genes related to membrane functioning have more open chromatin marks in smokers compared to non smokers. A manhattan plot will not tell us that. And in order to be able to detect that in our visualisation we had to already know what we were looking for in the first place.
My point in this example is the following: in order to detect that we would have to know what to visualise first (i.e. visualise the genes related to membrane function separately from the rest). But then when we are looking for these kinds of associations - the visualisation becomes unnecessary. We can capture the comparison of interest with a single number (i.e. average difference between smokers vs non-smokers within this group of genes). And then we can test all kinds of associations by running a script with a for-loop in order to check all possible groups of genes we care about and return a number for each. It's much faster than visualisation. And then after this type of EDA is done, the picture would be produced as a result, displaying the effect and highlighting the insights.
I understand your point about visualisation being an indistinguishable part of EDA. But the example I provided above is much closer to my lived experience.
Yeah, I agree with the general sentiment of what you're saying.
Re: wtallis, I think my original complaint about EDA per se is indeed off the mark.
Certainly creating a 20x20 grid of live-updating GPU plots and visualizations is a form of EDA, but it seems to suggest a complete lack of intuition about the problem you're solving. Like you're just going spelunking in a data set to see what you can find... and that's all you've got; no hypothesis, no nothing. I think if you're able to form even the meagerest of hypotheses, you should be able to eliminate most of these visualizations and focus on something much, much simpler.
I guess this tool purports to eliminate some of this, but there is also a degree of time-wasting involved in setting up all these visualizations. If you do more thinking up front, you can zero in on a smaller and more targeted subset of experiments. Simpler EDA tools may suffice. If you can prove your point with a single line or scatter plot (or number?), that's really the best case scenario.
Eigendecomposition of the covariance matrix, essentially PCA, is probably the first non-trivial step in the analysis of any dataset. The idea in the comment above seems to be that it's more useful to combine some basic knowledge of statistics with simpler visualisation techniques, rather than to quickly generate thousands of shallower plots. Being able to generate thousands of plot is useful, of course, but I would agree that promoting good data-analysis culture is more beneficial.
> Eigendecomposition of the covariance matrix, essentially PCA, is probably the first non-trivial step in the analysis of any dataset
For a sufficiently narrow definition of "dataset", perhaps. I don't think it's the obvious step one when you want to start understanding a time series dataset, for example. (Fourier transform would be a more likely step two, after step one of actually look at some of your data.)
For me, one of the most annoying things in my workflow is when I'm waiting for the software to catch up. If I'm making a plot, there's a lot of little tweaks I want to do to visually extract the maximum amount of information from a dataset. For example, if I'm making a histogram, I may want to adjust the number of bins, change to log scale, set min/max to remove outliers, and change the plot size on page. For the sake of the argument, let's say I'm working with a set of 8 slices of the dataset, so I need to regenerate 8 plots every time I make a tweak. My workflow is: Code the initial plots with default settings, run numpy to process the data, run matplotlib to display the data, look at the results, make tweaks to the code, circle back to step 2. In that cycle, "wait for matplotlib to finish generating the plots" can often be one of the longest parts of the cycle, and critically it's the vast majority of the cumulative time that I'm waiting rather than actively doing something. Drawing plots should be near instantaneous; there's an entire industry devoted to drawing complicated graphics in 16ms or less, I shouldn't need to wait >100ms for a single 2d grid with some dots and lines on it.
Matplotlib is okay, but there's definitely room for improvement, so why not go for that improvement?
I think this varies a lot depending on what you're doing.
I agree 100% that matplotlib is really slow and should be made to run as fast as humanly possible. I would add a (3) to my list above: optimize matplotlib!
OTOH, at least for what I'm doing, the code that runs to generate the data that gets plotted dominates the runtime 99% of the time.
For me, adjusting plots is usually the time waster. Hence point (2) above. I'd love to be able to make the tweaks using a WYSIWYG editor and have my plotting script dynamically updated. The bins, the log scale, the font, the dpi, etc, etc.
I think with your 8 slices examples above: my (2) and (3) would cover your bases. In your view, is the rest of matplotlib really so bad that it needs to be burnt to the ground for progress to be made?
Yeah, I'd love it if mpl could be optimized. I do think that it has a lot of weird design decisions that could justify burning it down and starting from scratch (e.g. weird mix of stateful and stateless api), but I've already learned most of its common quirks so I selfishly don't care anymore, and my only significant complaint is that I want it to be faster :)
edit: regarding runtime, I'm sure this varies a lot based on usecase, but for my usual usecase I store a mostly-processed dataset, so the additional processing before drawing the data is usually minimal.
I'd be curious to hear more about your EDA workflow.
What I want for EDA is a tool that let's me quickly toggle between common views of the dataset. I run through the same analysis over and over again, I don't want to type the same commands repeatedly. I have my own heuristics for which views I want, and I want a platform that lets me write functions that express those heuristics. I want to build the inteligence into the tool instead of having to remember a bunch of commands to type on each dataframe.
For manipulating the plot, I want a low-code UI that lets me point and click the operations I want to use to transform the dataframe. The lowcode UI should also emit python code to do the same operations (so you aren't tied to a low-code system, you just use it as a faster way to generate code then typing).
I have built the start of this for my open source datatable UX called Buckaroo. But it's for tables, not for plotting. The approach could be adapted to plotting. Happy to collaborate.
At least I usually do prefer to do the EDA plotting by writing and editing code. This is a lot more flexible. It's relatively rare to need other interactivity than zooming and panning.
The differing approaches probably can be seen in some API choices, although the fastplotlib API is a lot more ergonomic than many others. Having to index the figure or prefixing plots with add_ are minor things, and probably preferable for application development, but for fast-iteration EDA they will start to irritate fast. The "mlab" API of matplotlib violates all sorts of software development principles, but it's very convenient for exploratory use.
Matplotlib's performance, especially with interaction and animation, and clunky interaction APIs are definite pain points, and a faster and better interaction supporting library for EDA would be very welcome. Something like a mlab-type wrapper would probably be easy to implement for fastplotlib.
And to bikeshed a bit, I don't love the default black background. It's against usual conventions, difficult for publication and a bit harder to read when used to white.
Writing and editting code is a lot more flexible, but it gets repetitive, and I have written the same stuff so many times. It's all adhoc, and it fixes the problem at the time, then it gets thrown away with the notebook only to be written again soon.
As an example, I frequently want to run analytics on a dataframe. More complex summary stats. So you write a couple of functions, and have two for loops, iterating over columns and functions. This works for a bit. It's easy to add functions to the list. Then a function throws an error, and you're trying to figure out where you are in two nested for loops.
Or, especially for pandas, you want to separate functions to depend on the same expensive pre-calc. You could pass the existing dict of computed measures so you can reuse that expensive calculation... Now you have to worry about the ordering of functions.
So you could put all of your measures into one big function, but that isn't reusable. So you write your big function over and over.
I built a small dag library that handles this, and lets you specify that your analysis requires keys and provides keys, then the DAG of functions is ordered for you.
I work with R and not python, so some things might not apply, but this:
> [...] it fixes the problem at the time, then it gets thrown away with the notebook only to be written again soon.
Is one of the reasons I stopped using notebooks.
One solution to your problem might be to create a simple executable script that, when called on the file of your dataset in a shell, would produce the visualisation you need. If it's an interactive visualisation then I would create a library or otherwise a re-usable piece of code that can be sourced. It takes some time but ends up saving more time in the end.
If you have custom-made things you have to check on your data tables, then likely no library will solve your problem without you doing some additional the work on top.
And for these:
> Or, especially for pandas, you want to separate functions to depend on the same expensive pre-calc. [...] Now you have to worry about the ordering of functions.
I save expensive outputs to intermediate files, and manage dependencies with a very simple build-system called redo [1][2].
For larger datasets, real scripts are a better idea. I expect my stuff to work with datasets up to about 1Gb, caching is easy to layer on and would speed up work for larger datsets, but my code assumes the data fits in memory. It would be easier to add caching, the make sure I don't load an entire dataset into memory. (I don't serialize the entire dataframe to the browser though).
Usually I write scripts that use function memoization cache (to disk) for expensive operations. Recently I've also used Marimo sometimes, which has great support for modules (no reloading hacks), can memoize to disk and has deterministic state.
I agree with you sfpotter, very interesting. Looks in some ways similar to PyQtGraph regarding real time plotting.
I agree with you regarding matplotlib, although I find a lot of faults/frustration in using it. Both your points on 3D plotting and WYSIWYG editor would be extremely nice and as far as I know nothing exists in python ticking these marks. For 3D I typically default to Matlab as I've found it to be the most responsive/easy to use. I've not found anything directly like a WYSIWYG editor. Stata is the closest but I deplore it, R to some extent has it but if I'm generating multiple plots it doesn't always work out.
I'm surprised by what you said about "EDA". I find the opposite, a shotgun approach, exploring a vast number of plots with various stratifications gives me better insight. I've explored plotting across multiple languages (R,python,julia,stata) and not found one that meets all my needs.
The biggest issue I often face is I have 1000 plots I want to generate that are all from separate data groups and could all be plotted in parallel but most plotting libraries have holds/issues with distribution/parallelization. The closest I've found is I'll often build up a plot in python using a Jupyter notebook. Once I'm done I'll create a function taking all the needed data/saving a plot out, then either manually or with the help of LLMs convert it to julia which I've found to be much faster in loading large amounts of data and processing it. Then I can loop it using julia's "distributed" package. Its less then ideal, threaded access would be great, rather then having to distribute the data, but I've yet to find something that works. I'd love a simple 2D EDA plotting library that has basic plots like lines, histograms (1/2d), scatter plots, etc, has basic colorings and alpha values and is able to handle large amounts (thousands to millions of points) of static data and plot it saving to disk parallelized. I've debated writing my own library but I have other priorities currently, maybe once I finish my PhD.
For point (2), have you tried the perspective-viewer library? You can make edits in the UI and then use the "debug view" to copy and paste the new configuration back into your code.
I work on solving 3D problems: numerical methods for PDEs in R^3, computational geometry, computational mechanics, graphics, etc. Being able to make nice 3D plots is super important for this. I agree it's not always necessary, and when a 2D plot suffices, that's the way to go, but that doesn't obviate my need for 3D plots.
3D plots might be neat if there was some widespread way of displaying them. Unfortunately we can only make 2D projections of 3D plots on our computer screens and pieces of paper.
Shameless plug: I'm actively working on a similar project, Datoviz [1], a C/C++ library with thin Python bindings (ctypes). It supports both 2D and 3D but is currently less mature and feature-complete than fastplotlib. It is also lower level (high-level capabilities will soon be provided by VisPy 2.0 which will be built on top of Datoviz, among other possible backends).
My focus is primarily on raw performance, visual quality, and scalability for large datasets—millions, tens of millions of points, or even more.
I have always admired your datoviz library from afar and check the vispy2/vispy2-sandbox libraries on GitHub every few months to check up on it. When do you think 'soon' is?? Really looking forward to it!
Thanks! The code is currently managed by Nicolas Rougier in a GitHub repository that will be made public next week. This repository hosts the "graphics server protocol" (GSP), an intermediate layer between Datoviz and the future high-level plotting API. For the latter, we’ll need community feedback to shape an API philosophy that aligns with VisPy users' needs—let's aim to publish a write-up this month.
Implementing the API on top of GSP should be relatively straightforward, as the core graphics-related mechanisms are handled by GSP/Datoviz. We've created a Slack channel for discussions—contact me privately if you'd like to join.
I'm certain the host heavy lifting is done by numpy which is a python wrapper around Fortran and C. The visualization heavy lifting is done by pygfx/wgpu-py. wgpu-py has C. I think wgpu-py compiles to WASM to run in the browser. More and more packages are taking this route.
In fastplotlib at the end of the day everything is wgpu under the hood, and as the other poster correctly pointed out about numpy being fortran and C wrappers.
Seems like a nice library, but I have a hard time seeing myself using it over plotly. The plotly express API is just so simple and easy. For example, here's the docs for the histogram plot: https://plotly.com/python/histograms/
This code gives you a fully interactive, and performant, histogram plot:
Different use cases :) Plotly doesn't give the performance and interactive tools required for many neuroscience visualizations. We also focus more on the primitive graphics and, at least not yet, on the more complex "composite" graphics built with primitives like histograms.
I appreciate the warning and if it's not by claude I apologize, but I do think we should be allowed to express scepticism if things posted are just AI slop (and if we have to fear getting banned or what-have-you as a consequence I genuinely think that's worse for HN long term than the alternative).
Every two weeks or so I peruse github looking for something like this and I have to say this looks really promising. In statistical genetics we make really big scatterplots called Manhattan plots https://en.wikipedia.org/wiki/Manhattan_plot and we have to use all this highly specialized software to visualize at different scales (for a sense of what this looks like: https://my.locuszoom.org/gwas/236887/). Excited to try this out
Hey! This sounds like a really interesting use case. If you run into any issues or need help with the visualization, please don't hesitate to post an issue on the repo. We can also think about adding an example demo of a manhattan plot to help too!
If you’re working in R with ggplot2, you could also consider the `ggrastr` package, specifically, `ggrastr::geom_point_rast`
Have you tried ManimGL?
https://github.com/3b1b/manim/releases
Super awesome, and you can make it into an MCP for Cursor.
I always thought it was interesting that my modern CPU takes ages to plot 100,000 or so points in R or Python (ggplot2, seaborn, plotnine, etc) and yet somehow my 486DX 50Mhz could pump out all those pixels to play Doom interactively and smoothly.
How is it compared to HoloViz?[1]
I followed one of their online workshops, and it feels really powerful, although it is a bit confusing which part of it does what (it's basically 6 or 7 projects put together under an umbrella)
[1] https://holoviz.org/
> powered by WGPU, a cross-platform graphics API that targets Vulkan (Linux), Metal (Mac), and DX12 (Windows).
The fact that they are using WGPU, which appears to be a Python native implementation of WebGPU, suggests there is an interesting possible extended case. As a few other comments suggest, if one knows that the data is available on a machine in a cluster rather than on the local machine of a user, it might make sense to start up a server, expose a port and pass along the data over http to be rendered in a browser. That would make it shareable across the lab. The limit would be the data bandwidth over http (e.g. for the 3 million point case) but it seems like for simpler cases it would be very useful.
That would lead to an interesting exercise of defining a protocol for transferring plot points over http in such a way they could be handed over to a the browser WebGPU interface efficiently. Perhaps even a more efficient representation is possible with some pre-processing on the server side?
> the data is available on a machine in a cluster rather than on the local machine of a user
jupyter-rfb lets you do remote rendering for this, render to a remote frame buffer and send over a jpeg byte stream. We and a number of our scientific users use it like this. https://fastplotlib.org/ver/dev/user_guide/faq.html#what-fra...
> defining a protocol for transferring plot points
This sounds more like GSP, which Cyrille Rossant (who's made some posts here) works on, it has a slightly different kind of use case.
What is GSP in this context? Searching Python GSP brings up Generalized Sequence Pattern (GSP) algorithm [1] and Graph Signal Processing [2], neither of which seem to be a protocol. I also found "Generic Signaling Protocol" and "Global Sequence Protocol" which also don't seem relevant. Forgive me if GSP is some well know thing which I am just not familiar with.
1. https://github.com/jacksonpradolima/gsp-py
2. https://pygsp.readthedocs.io/en/stable/
Graphics Server Protocol
Forgive me for doing this, but I used an LLM to find that. They’re exceptionally useful for disambiguation tasks like this. Knowing what an acronym refers to is very useful for next token prediction, so they’re quite good at it. It’s usually trivial to figure out if they’re hallucinating with a search engine.
[1] https://news.ycombinator.com/item?id=43335769
I don't think it's ready yet and I think it might be private at the moment, Cyrille can comment more on it.
But if I understand correctly it's a protocol for serializing graphical objects, pretty neat idea.
What you describe sounds a bit like Graphistry:
https://pygraphistry.readthedocs.io/en/latest/performance.ht...
WGPU is a Rust thing more than a Python thing.
Fair, I was looking at the wgpu-py [1] page but only skimmed it. It does indeed look like a wrapper over wgpu-native [2] which is written in Rust.
1. https://github.com/pygfx/wgpu-py
2. https://github.com/gfx-rs/wgpu-native
Do you have any numbers for the rough number of datapoints that can be handled? I'm curious if this enables plotting many millions of datapoints in a scatterplot for example.
Yes! The number of data points can range in the millions. Quite honestly, the quality of your GPU would be the limiting factor here. I will say, however, that for most use cases, an integrated GPU is sufficient. For reference, we have plotted upwards of 3 million points on a mid-range integrated GPU from 2017.
I will work on adding somewhere in our docs some metrics for this kind of thing (I think it could be helpful for many).
>I will work on adding somewhere in our docs some metrics for this kind of thing (I think it could be helpful for many).
Certainly! A comparison of performance with specialized tools for large point clouds would be very interesting (like cloudcompare and potree).
I have watched recordings of your recent representation and decided to finally give it a try last week. My goal is to create some interactive network visualizations - like letting you click/box select nodes and edges to highlight subgraphs which sounds possible with the callbacks and selectors.
Haven't had the time to get very far yet, but will gladly contribute an example once I figure something out. Some of the ideas I want to eventually get to is to render shadertoys(interactively?) into a fpl subplot (haven't looked at the code at all, but might be doable), eventually run those interactively in the browser and do the network layout on the GPU with compute shaders (out of scope for fpl).
Hi! I've seen some of your work on wgpu-py! Definitely let us know if you need help or have ideas, if you're on the main branch we recently merged a PR that allows events to be bidirectional.
Sounds really compelling.
But it doesn't seem to answer how it works in Jupyter notebooks, or if it does at all. Is the GPU acceleration done "client-side" (JavaScript?) or "server-side" (in the kernel?) or is there an option for both?
Because I've used supposedly fast visualization libraries in Google Colab before, but instead of updating at 30 fps, it takes 2 seconds to update after a click, because after the new image is rendered it has to be transmitted via the Jupyter connector and network and that can turn out to be really slow.
Fastplotlib definitely works in Jupyterlab through jupyter-rfb https://github.com/vispy/jupyter_rfb
I believe the performance is pretty decent, especially if you run the kernel locally
Their docs also cover this as mentioned by @clewis7 below: https://www.fastplotlib.org/ver/dev/user_guide/faq.html#what...
Thanks Ivo!
Just to add on, colab is weird and not performant, this PR outlines our attempts to get jupyter-rfb working on colab: https://github.com/vispy/jupyter_rfb/pull/77
Thanks. Yeah I've been baffled as to why just interactive Matplotlib with a Colab kernel is so slow. The Colab CPU is fast (enough), the network is fast, I haven't been able to figure out where the bottleneck is either.
Is google colab slower than an equivalently powerful kernel running on a remote jupyter kernel? Are you running into network problems, or is it something specific to colab?
Thanks Ivo!
Looks very interesting. Does it allow to plot lines of varying thickness?
This looks super cool! Looking forward to trying it.
I think a killer feature of these gpu-plotting libraries would be if they could take torch/jax cuda arrays directly and not require a (slow) transfer over cpu.
Thanks! That is a great question and one that I've we've been battling with as well. As far as we know, this is not possible due to the way different contexts are set up on the GPU https://github.com/pygfx/pygfx/issues/510
tinygrad which I haven't used seems torch-like and has a WGPU backend: https://github.com/tinygrad/tinygrad
Yeah, I remember looking into it myself as well, and not finding any easy path. A shame.... Maybe there's a hard way to do it though :)
I've been looking into this issue with Datoviz [1] following a user request. It turns out there may be a way to achieve it using Vulkan [2] (which Datoviz is based on) and CuPy's UnownedMemory [3]. I wrote a simple proof of concept using only Vulkan and CuPy.
I'm now working on a way for users to wrap a Datoviz GPU buffer as a CuPy array that directly references the Datoviz-managed GPU memory. This should, in principle, enable efficient GPU-based array operations on GPU data without any transfers.
[1] https://datoviz.org/
[2] https://registry.khronos.org/vulkan/specs/latest/man/html/VK...
[3] https://docs.cupy.dev/en/latest/reference/generated/cupy.cud...
This looks cools thanks! Makes me wonder if there's any way to do that with WGPU if WGPU is interfacing with Vulkan, probably not easy if possible I"m guessing.
WGPU has security protections since it's designed for the browser so I'm guessing it's impossible.
Indeed, it doesn't seem to be possible at the moment, see e.g. https://github.com/gfx-rs/wgpu/issues/4067
Wow. So are you saying that you can have some array on the GPU that you setup with python via CuPy, then you call to the webbrowser and give it the pointer address for that GPU array, and the browser through WASM/WebGPU can access that same array? That sounds like a huge browser security hole.
Yea the security issue is why I'm pretty sure you can't do it on WGPU, but Vulkan and cupy can fully run locally so it doesn't have the same security concern.
Exactly, this is the sort of thing you can more easily do on desktop than in a web browser.
Would it be possible to leverage the python array api standard? Or is that more suited for just computations?
Really nice post introducing your library.
When would you reach for a different library instead of fastplotlib?
How does this deal with really large datasets? Are you doing any type of downsampling?
How does this work with pandas? I didn't see it as a requirement in setup.py
Does this work in Jupyter notebooks? What about marimo?
Thanks!
> When would you reach for a different library instead of fastplotlib?
Use the best tool for your usecase, we're focused on GPU accelerated interactive visualization. Our use cases broadly are developing ML algorithms, user-end ML Ops tools, and looking live data off of live scientific instruments.
> How does this deal with really large datasets? Are you doing any type of downsampling?
Depends on your hardware, see https://fastplotlib.org/ver/dev/user_guide/faq.html#do-i-nee...
> How does this work with pandas? I didn't see it as a requirement in setup.py
If you pass in numpy-like types that use the buffer protocol it should work, we also want to support direct dataframe input in the future: https://github.com/fastplotlib/fastplotlib/issues/395
There are more low-level priorities in the meantime.
> Does this work in Jupyter notebooks? What about marimo?
Jupyter yes via juptyer-rfb, see our repo: https://github.com/fastplotlib/fastplotlib?tab=readme-ov-fil...
Looking forward to checking out your library, thanks for sharing it with the world.
I’ve been using kst-plot for live streaming data from instruments and interactive plots. It’s fast and I haven’t found any limit for the amount of data it can plot. Development has basically stopped - the product is done, feature complete, and works perfectly! It is used by European and Canadian space agencies. Maybe it will be interesting to you to see how they have solved or approached some of the same problems you have solved or will also solve !
That would be preposterous if it wasn't so hilariously false:
> These days, having a GPU is practically a prerequisite to doing science, and visualization is no exception.
It becomes really funny when they go on to this, as if it was a big deal:
> Depicted below is an example of plotting 3 million points
Anybody who has ever used C or fortran knows that a modern CPU can easily churn through "3 million points" at more than 30 frames per second, using just one thread. It's not a particularly impressive feat, three million points is the size of a mid-resolution picture, and you can zoom-in and out those trivially in real-time using a CPU (and you could do that 20 years ago, as well). Maybe the stated slowness of fastplotlib comes from the unholy mix of rust and python?
Now, besides this rant, I think that fastplotlib is fantastic and, as an (unwilling) user of Python for data science, it's a godsend. It's just that the hype of that website sits wrong in me. All the demos show things that could be done much easier and just as fast when I was a teenager. The big feat, and a really big one at that, is that you can access this sort of performance from python. I love it, in a way, because it makes my life easier now; but it feels like a self-inflicted problem was solved in a very roundabout way.
>> Depicted below is an example of plotting 3 million points
> Anybody who has ever used C or fortran knows that a modern CPU can easily churn through "3 million points" at more than 30 frames per second, using just one thread. It's not a particularly impressive feat, three million points is the size of a mid-resolution picture, and you can zoom-in and out those trivially in real-time using a CPU (and you could do that 20 years ago, as well). Maybe the stated slowness of fastplotlib comes from the unholy mix of rust and python?
That's a misrepresentation though, it's 3 million points in sine waves, e.g. something like 1000 sine waves with e.g. 3000 points in each. If you look at the zoomed in image, the sine waves are spaced significantly, so if you would represent this as an image it would be at least a factor 10 larger. In fact that is likely a significant underestimation, i.e. you need to connect the points inside the sine waves as well.
The comparison case would be to take a vector graphics (e.g. svg) with 1000 sine wave lines and open it in a viewer (written in C or Fortran if you want) and try zooming in and out quickly.
Thanks, and the purpose was to show what's possible on modest hardware that most people have. We have created gigabytes of graphics that live on the gpu for more complex use cases and they remain performant, but you need a gaming gpu.
But why do you want to fit the whole dataset in memory? If the dataset is stored in a tiled and multi-scaled representation you need to only grab the part of it that is needed to fit your screen (which is a constant, small amount of data, even if the dataset is arbitrarily large).
If you insist to fit the entire thing in memory, it may seem better to do so in the plain RAM, which nowadays is of humongous size even in "modest" systems.
Nice, i'd be interested to know which method for drawing lines (which is hard [0]) it uses.
[0] https://mattdesl.svbtle.com/drawing-lines-is-hard
Almar made blog posts about the line shader he wrote!
https://almarklein.org/triangletricks.html
https://almarklein.org/line_rendering.html
A big shader refactor was done in this PR: https://github.com/pygfx/pygfx/pull/628
Thank you!
I know 3D is in the roadmap. Once the basic functionality is in place, it would be great to also consider integrating molecular visualization or at least provide enough fast primitives to simplify the integration of molecular visualization tools with this library.
We are definitely looking forward to adding more 3D graphics in the future, and this sounds really cool. Would you mind posting an issue on the repo? I think this is something we would want to have on the roadmap or at least an open issue to plan out how we could do this. Thanks!
I’m often working with a windows desktop and a remote Linux box on which I have my data & code. I’d like to plot “locally” on my desktop workstation from the remote host. This usually either means using X11 (slow) or some sort of web-based library like plotly. Does fastplotlib offer any easy solution here?
This is exactly why we use jupyter-rfb, I often have large datasets on a remote cluster computer and we perform remote rendering.
see: https://fastplotlib.org/ver/dev/user_guide/faq.html#what-fra...
I'm in the same boat as the person you replied to, but have zero experience with remote plotting other that doing static plots in in a remote session in the interactive window provided by VS Code's python extension. Would this also work there, or would I have to start using jupyter notebooks?
non-jupyter notebook implementations have their quirks, eventually we hope to make a more universal jupyter-rfb kind of library, perhaps using anywidget. Anywidget is awesome: https://github.com/manzt/anywidget
People have used fastplotlib and jupyter-rfb in vscode, but it can be troublesome and we don't currently have the resources to figure out exactly why.
Alright, thanks. I don't particularly like notebook, but this might a reason to give it another go.
Sometimes I wish these plotting libraries were more portable beyond Python only. I was looking for something similar for Ruby just a while ago but the install instructions seemed out of date and unsupported on Windows.
Any sufficiently advanced plotting library with an api that can be called externally becomes indistinguishable from a GUI toolkit: https://www.gnu.org/software/guile/docs/guile-tut/tutorial.h...
Not sure if that is the right tutorial, but many years ago in the guile 1.x days I wrote a local visualizer for the data from a particle physics accelerator entirely in Guile and Gnuplot. It was very MVC and used guile as the controller and Gnuplot as the viewer.
Was it stupid? Yes. Did it work better than all the other tools I had at the time? Also yes.
I do not know ruby but sometimes that's an opportunity to try and make one which others will also find useful :)
Very cool to see imgui empowering so many different things.
We love imgui! Big thanks to the imgui devs, and Pascal Thomet who maintains the python bindings for imgui-bundle, and https://github.com/panxinmiao who made an Imgui Renderer for wgpu-py!
Imgui is awesome! Thanks for mentioning imgui-bundle—I hadn’t heard of it before, but it looks great! [1]
[1] https://github.com/pthom/imgui_bundle
Looks very interesting for interactive visualization. I like the animation interface. Also love imgui, glad to see it here. I wish I had better plotting tools for publication quality images (though, honestly I'm pretty happy with matplotlib).
Thanks! Yup our focus is not publication figures, matplotlib and seaborn cover that space pretty well.
Very interesting and promising package.
I especially like that there is a PyQt interface which might provide an alternative to another great package: pyqtgraph[0].
[0] https://github.com/pyqtgraph/pyqtgraph
Thanks! I used pyqtgraph for many years and love what can be done by it, we started off wanting to build something like it but based on WGPU and not bound to Qt.
Thank you for your interest! We have taken a lot of inspiration from pyqtgraph and really like their library.
[dead]
Is it possible to put the interactive plots on your website? Or is this a Jupyter notebook only tool.
See here: https://www.fastplotlib.org/ver/dev/user_guide/faq.html#what...
We are hoping for pyodide integration soon, which would allow fastplotlib to be run strictly in the browser!
Thanks. That will be very cool.
In the browser only jupyter for now, you can use voila to make a server based application using jupyter: https://github.com/voila-dashboards/voila
As Caitlin pointed out below pyodide is a future goal.
This is very nice. But thinking more along the lines of, can I embed a single interactive widget in a blog post.
Not today, it requires wgpu-py to support running on WASM / pyodide, which it doesn't yet (unfortunately)
One of the big bottlenecks of plotting libraries is simply the time it takes to import the library. I’ve seen matplotlib being slow to import, and in Julia they even have a “time to first plot” metric. I’d be curious to see how this library compares.
I think one nice thing that we have tried to do is limit super heavy dependencies and also separate optional dependencies to streamline things.
The quickest install would be `pip install fastplotlib`. This would be if you were interested in just having the barebones (no imgui or notebook) for desktop viz using something like glfw.
We can think about adding in our docs some kind of import time metrics.
Almar did some work on speeding up imports a year ago: https://github.com/fastplotlib/fastplotlib/pull/431
but we haven't benchmarked it yet
https://archive.md/G3wj6
Yeah, many browsers have webgpu turned off by default, So you're stuck with wasm (wasm Simd if you're lucky)
Hopefully both are implemented.
This library builds upon pygfx and wgpu-py. Unfortunately, the latter doesn't support running on WASM, pyscript or pyodide yet, but there's an issue about it:
https://github.com/pygfx/wgpu-py/issues/407
PRs welcome though :-)
GPU all the things! GPU-accelerated Tableau would be incredible.
I’m not making neuroscience visualizations. I’m working with rather line graphs and would like to animate based on ~10000 points. I’m looking to convert these visuals to video for youtube, in hd and at 60fps using the HEVC/h.265 codec. I took a quick look at the documentation to see if this is possible and I didn’t see anything. Are or will this sort of rendering be supported?
I previously tried this on matplotlib and it took 20-30 minutes to make a single rendering because matplotlib only uses a single core on a cpu and doesn’t support gpu acceleration. I also tried Man im, but I couldn’t get an actual video file, and opengl seems to be a bit complicated to work with (I went and worked on other things though I should ask around about the video file output). Anyway, I’m excited about the prospect of a gpu accelerated dataviz tool that utilizes Vulkan, and I hope this library can cover my usecase.
Rendering frames and saving them to disk can be done with rendercanvas but we haven't exposed this in fastplotlib yet: https://github.com/pygfx/rendercanvas/issues/49
Another tool that requires precise control over memory layout, bandwidth, performance… using Python.
... using Python... itself leveraging NumPy, C, the GPU...
>sine_wave.colors[::3] = "red"
I never knew I needed this until now
We offer a lot of ways to slice colors, set cmaps and cmap transforms, they are really useful in neuroscience:
https://fastplotlib.org/ver/dev/_gallery/line/line_colorslic...
https://fastplotlib.org/ver/dev/_gallery/line/line_cmap_more...
https://fastplotlib.org/ver/dev/_gallery/line/line_cmap.html...
And with collections if you want to go crazy: https://fastplotlib.org/ver/dev/_gallery/line_collection/lin...
Very cool effort. That said, and it's probably because of the kind of work that I do, but I have almost never found the four challenges to be any kind of a problem for me. Although I do think there is some kind of contradiction there. Plotting (exploratory data analyis ("EDA"), really) is all about distilling key insights and finding features hidden in data. But you have to some kind of intuition about where the needle in the haystack is. IME, throwing up a ton of plots and being able to scrub around in them never seems to provide much insight. It's also very fast---usually the feedback loop is like "make a plot, go away and think about it for an hour, decide what plot I need to make next, repeat". If there is too much data on the screen it defeats the point of EDA a little bit.
For me, matplotlib still reigns supreme. Rather than a fancy new visualization framework, I'd love for matplotlib to just be improved (admittedly, fastplotlib covers a different set of needs than what matplotlib does... but the author named it what they named it, so they have invited comparison. ;-) ).
Two things for me at least that would go a long way:
1) Better 3D plotting. It sucks, it's slow, it's basically unusable, although I do like how it looks most of the time. I mainly use PyVista now but it sure would be nice to have the power of a PyVista in a matplotlib subplot with a style consistent with the rest of matplotlib.
2) Some kind of WYSIWYG editor that will let you propagate changes back into your plot easily. It's faster and easier to adjust your plot layout visually rather than in code. I'd love to be able to make a plot, open up a WYSIWYG editor, lay things out a bit, and have those changes propagate back to code so that I can save it for all time.
(If these features already exist I'll be ecstatic ;-) )
I have to agree with your point about EDA. The library is neat, but even the example of covariance matrix animation is a bit contrived.
Every pixel has a covariance with every other pixel, so sliding though the rows of the covariance matrix generates as many faces on the right as there are pixels in a photograph of a face. However the pixels that strongly co-vary will produce very similar right side "face" pictures. To get a sense of how many different behaviours there are one would look for eigenvectors of this covariance matrix. And then 10 or so static eigenvectors of the covariance matrix (eigenfaces [1]) would be much more informative than thousands of animated faces displayed in the example.
Some times a big interactive visualisation can be a sign of not having a concrete goal or not knowing how to properly summarise. After all that's the purpose of a figure - to highlight insights, not to look for ways to display the entire dataset. And pictures that try to display the whole dataset end up shifting the job of exploratory analysis to a visual space and leave it for somebody else.
Thou of course there are exceptions.
[1]: https://en.wikipedia.org/wiki/Eigenface
Hi, one of the other devs here. As the poster below pointed out what you're missing is that in this case we know that an eigendecomposition or PCA will be useful. However if you're working on matrix decomposition algorithms like us, or if you're trying to design new forms of summary matrices because a covariance matrix isn't informative for your type of data then these types of visualizations are useful. We broadly work on designing new forms of matrix decomposition algorithms so it's very useful to look at the matrices and then try to determine what types of decompositions we want to do.
I've also worked on designing new matrix decompositions, and I've never found the need for anything but `imshow`...
ok, different libraries have different use cases, the type of data we work with absolutely necessitates dynamic visualization. You wouldn't view a video with imshow would you?
Every time I've needed to scrub through something in time like that, dumping a ton of frames to disk using imshow has been good enough. Usually, the limiting factor is how quickly I can generate a single frame.
It's hard for me to imagine what you're doing that necessitates such fancy tools, but I'm definitely interested to learn! My failure of imagination is just that.
The example from the article with the subtitle "Large-scale calcium imaging dataset with corresponding behavior and down-stream analysis" is a good example. We have brain imaging video that is acquired simultaneously with behavioral video data. It is absolutely essential to view the raw video at 30-60Hz.
Aren't you missing the entire point of exploratory data analysis? Eigenfaces are an example of what you can come up with as the end product of your data exploration, after you've tried many ways of looking at the data and determined that eigenfaces are useful.
Your whole third paragraph seems to be criticizing the core purpose of exploratory data analysis as though one should always be able to skip directly to the next phase of having a standardized representation. When entering a new problem domain, somebody needs to actually look at the data in a somewhat raw form. Using the strengths of the human vision system to get a rough idea of what the typical data looks like and the frequency and character of outliers isn't dumping the job of exploratory data analysis onto the reader, it's how the job actually gets done in the first place.
> Using the strengths of the human vision system to get a rough idea of what the typical data looks like and the frequency and character of outliers isn't dumping the job of exploratory data analysis onto the reader, it's how the job actually gets done in the first place.
Yup this is a good summary of the intent, we also have to remember that the eigenfaces dataset is a very clean/toy data example. Real datasets never look this good, and just going straight to an eigendecomp or PCA isn't informative without first taking a look at things. Often you may want to do something other than an eigendecomp or PCA, get an idea of your data first and then think about what to do to it.
Edit: the point of that example was to show that visually we can judge what the covariance matrix is producing in the "image space". Sometimes a covariance matrix isn't even the right type of statistic to compute from your data and interactively looking at your data in different ways can help.
As a whole, of course you have a point - big visualisations when done properly should help with data exploration. However, from my experience they rarely (but not never) do. I think it's specific to the type of data you work with and the visualisation you employ. Let me give an example.
Imagine we have some big data - like an OMIC dataset about chromatin modification differences between smokers and non-smokers. Genomes are large so one way to visualise might be to do a manhattan plot (mentioned here in another comment). Let's (hypothetically) say the pattern in the data is that chromatin in the vicinity of genes related to membrane functioning have more open chromatin marks in smokers compared to non smokers. A manhattan plot will not tell us that. And in order to be able to detect that in our visualisation we had to already know what we were looking for in the first place.
My point in this example is the following: in order to detect that we would have to know what to visualise first (i.e. visualise the genes related to membrane function separately from the rest). But then when we are looking for these kinds of associations - the visualisation becomes unnecessary. We can capture the comparison of interest with a single number (i.e. average difference between smokers vs non-smokers within this group of genes). And then we can test all kinds of associations by running a script with a for-loop in order to check all possible groups of genes we care about and return a number for each. It's much faster than visualisation. And then after this type of EDA is done, the picture would be produced as a result, displaying the effect and highlighting the insights.
I understand your point about visualisation being an indistinguishable part of EDA. But the example I provided above is much closer to my lived experience.
Yeah, I agree with the general sentiment of what you're saying.
Re: wtallis, I think my original complaint about EDA per se is indeed off the mark.
Certainly creating a 20x20 grid of live-updating GPU plots and visualizations is a form of EDA, but it seems to suggest a complete lack of intuition about the problem you're solving. Like you're just going spelunking in a data set to see what you can find... and that's all you've got; no hypothesis, no nothing. I think if you're able to form even the meagerest of hypotheses, you should be able to eliminate most of these visualizations and focus on something much, much simpler.
I guess this tool purports to eliminate some of this, but there is also a degree of time-wasting involved in setting up all these visualizations. If you do more thinking up front, you can zero in on a smaller and more targeted subset of experiments. Simpler EDA tools may suffice. If you can prove your point with a single line or scatter plot (or number?), that's really the best case scenario.
Eigendecomposition of the covariance matrix, essentially PCA, is probably the first non-trivial step in the analysis of any dataset. The idea in the comment above seems to be that it's more useful to combine some basic knowledge of statistics with simpler visualisation techniques, rather than to quickly generate thousands of shallower plots. Being able to generate thousands of plot is useful, of course, but I would agree that promoting good data-analysis culture is more beneficial.
> Eigendecomposition of the covariance matrix, essentially PCA, is probably the first non-trivial step in the analysis of any dataset
For a sufficiently narrow definition of "dataset", perhaps. I don't think it's the obvious step one when you want to start understanding a time series dataset, for example. (Fourier transform would be a more likely step two, after step one of actually look at some of your data.)
I agree, but: the technique of “singular spectrum analysis” is pretty much PCA applied to a covariance matrix resulting from time-lagging the original time series. (https://en.wikipedia.org/wiki/Singular_spectrum_analysis)
So this is not unheard of for time series analysis.
Exactly that's a good example!
For me, one of the most annoying things in my workflow is when I'm waiting for the software to catch up. If I'm making a plot, there's a lot of little tweaks I want to do to visually extract the maximum amount of information from a dataset. For example, if I'm making a histogram, I may want to adjust the number of bins, change to log scale, set min/max to remove outliers, and change the plot size on page. For the sake of the argument, let's say I'm working with a set of 8 slices of the dataset, so I need to regenerate 8 plots every time I make a tweak. My workflow is: Code the initial plots with default settings, run numpy to process the data, run matplotlib to display the data, look at the results, make tweaks to the code, circle back to step 2. In that cycle, "wait for matplotlib to finish generating the plots" can often be one of the longest parts of the cycle, and critically it's the vast majority of the cumulative time that I'm waiting rather than actively doing something. Drawing plots should be near instantaneous; there's an entire industry devoted to drawing complicated graphics in 16ms or less, I shouldn't need to wait >100ms for a single 2d grid with some dots and lines on it.
Matplotlib is okay, but there's definitely room for improvement, so why not go for that improvement?
I think this varies a lot depending on what you're doing.
I agree 100% that matplotlib is really slow and should be made to run as fast as humanly possible. I would add a (3) to my list above: optimize matplotlib!
OTOH, at least for what I'm doing, the code that runs to generate the data that gets plotted dominates the runtime 99% of the time.
For me, adjusting plots is usually the time waster. Hence point (2) above. I'd love to be able to make the tweaks using a WYSIWYG editor and have my plotting script dynamically updated. The bins, the log scale, the font, the dpi, etc, etc.
I think with your 8 slices examples above: my (2) and (3) would cover your bases. In your view, is the rest of matplotlib really so bad that it needs to be burnt to the ground for progress to be made?
Yeah, I'd love it if mpl could be optimized. I do think that it has a lot of weird design decisions that could justify burning it down and starting from scratch (e.g. weird mix of stateful and stateless api), but I've already learned most of its common quirks so I selfishly don't care anymore, and my only significant complaint is that I want it to be faster :)
edit: regarding runtime, I'm sure this varies a lot based on usecase, but for my usual usecase I store a mostly-processed dataset, so the additional processing before drawing the data is usually minimal.
I'd be curious to hear more about your EDA workflow.
What I want for EDA is a tool that let's me quickly toggle between common views of the dataset. I run through the same analysis over and over again, I don't want to type the same commands repeatedly. I have my own heuristics for which views I want, and I want a platform that lets me write functions that express those heuristics. I want to build the inteligence into the tool instead of having to remember a bunch of commands to type on each dataframe.
For manipulating the plot, I want a low-code UI that lets me point and click the operations I want to use to transform the dataframe. The lowcode UI should also emit python code to do the same operations (so you aren't tied to a low-code system, you just use it as a faster way to generate code then typing).
I have built the start of this for my open source datatable UX called Buckaroo. But it's for tables, not for plotting. The approach could be adapted to plotting. Happy to collaborate.
At least I usually do prefer to do the EDA plotting by writing and editing code. This is a lot more flexible. It's relatively rare to need other interactivity than zooming and panning.
The differing approaches probably can be seen in some API choices, although the fastplotlib API is a lot more ergonomic than many others. Having to index the figure or prefixing plots with add_ are minor things, and probably preferable for application development, but for fast-iteration EDA they will start to irritate fast. The "mlab" API of matplotlib violates all sorts of software development principles, but it's very convenient for exploratory use.
Matplotlib's performance, especially with interaction and animation, and clunky interaction APIs are definite pain points, and a faster and better interaction supporting library for EDA would be very welcome. Something like a mlab-type wrapper would probably be easy to implement for fastplotlib.
And to bikeshed a bit, I don't love the default black background. It's against usual conventions, difficult for publication and a bit harder to read when used to white.
Writing and editting code is a lot more flexible, but it gets repetitive, and I have written the same stuff so many times. It's all adhoc, and it fixes the problem at the time, then it gets thrown away with the notebook only to be written again soon.
As an example, I frequently want to run analytics on a dataframe. More complex summary stats. So you write a couple of functions, and have two for loops, iterating over columns and functions. This works for a bit. It's easy to add functions to the list. Then a function throws an error, and you're trying to figure out where you are in two nested for loops.
Or, especially for pandas, you want to separate functions to depend on the same expensive pre-calc. You could pass the existing dict of computed measures so you can reuse that expensive calculation... Now you have to worry about the ordering of functions.
So you could put all of your measures into one big function, but that isn't reusable. So you write your big function over and over.
I built a small dag library that handles this, and lets you specify that your analysis requires keys and provides keys, then the DAG of functions is ordered for you.
How do other people approach these issues?
I work with R and not python, so some things might not apply, but this:
> [...] it fixes the problem at the time, then it gets thrown away with the notebook only to be written again soon.
Is one of the reasons I stopped using notebooks.
One solution to your problem might be to create a simple executable script that, when called on the file of your dataset in a shell, would produce the visualisation you need. If it's an interactive visualisation then I would create a library or otherwise a re-usable piece of code that can be sourced. It takes some time but ends up saving more time in the end.
If you have custom-made things you have to check on your data tables, then likely no library will solve your problem without you doing some additional the work on top.
And for these:
> Or, especially for pandas, you want to separate functions to depend on the same expensive pre-calc. [...] Now you have to worry about the ordering of functions.
I save expensive outputs to intermediate files, and manage dependencies with a very simple build-system called redo [1][2].
[1]: http://www.goredo.cypherpunks.su
[2]: http://karolis.koncevicius.lt/posts/using_redo_to_manage_r_d...
Thanks. I see how redo works.
For larger datasets, real scripts are a better idea. I expect my stuff to work with datasets up to about 1Gb, caching is easy to layer on and would speed up work for larger datsets, but my code assumes the data fits in memory. It would be easier to add caching, the make sure I don't load an entire dataset into memory. (I don't serialize the entire dataframe to the browser though).
Usually I write scripts that use function memoization cache (to disk) for expensive operations. Recently I've also used Marimo sometimes, which has great support for modules (no reloading hacks), can memoize to disk and has deterministic state.
I agree with you sfpotter, very interesting. Looks in some ways similar to PyQtGraph regarding real time plotting.
I agree with you regarding matplotlib, although I find a lot of faults/frustration in using it. Both your points on 3D plotting and WYSIWYG editor would be extremely nice and as far as I know nothing exists in python ticking these marks. For 3D I typically default to Matlab as I've found it to be the most responsive/easy to use. I've not found anything directly like a WYSIWYG editor. Stata is the closest but I deplore it, R to some extent has it but if I'm generating multiple plots it doesn't always work out.
I'm surprised by what you said about "EDA". I find the opposite, a shotgun approach, exploring a vast number of plots with various stratifications gives me better insight. I've explored plotting across multiple languages (R,python,julia,stata) and not found one that meets all my needs.
The biggest issue I often face is I have 1000 plots I want to generate that are all from separate data groups and could all be plotted in parallel but most plotting libraries have holds/issues with distribution/parallelization. The closest I've found is I'll often build up a plot in python using a Jupyter notebook. Once I'm done I'll create a function taking all the needed data/saving a plot out, then either manually or with the help of LLMs convert it to julia which I've found to be much faster in loading large amounts of data and processing it. Then I can loop it using julia's "distributed" package. Its less then ideal, threaded access would be great, rather then having to distribute the data, but I've yet to find something that works. I'd love a simple 2D EDA plotting library that has basic plots like lines, histograms (1/2d), scatter plots, etc, has basic colorings and alpha values and is able to handle large amounts (thousands to millions of points) of static data and plot it saving to disk parallelized. I've debated writing my own library but I have other priorities currently, maybe once I finish my PhD.
Interested to hear what your PhD is in.
I agree on the refinement of matplotlib, we all need it to be better at resource handling, lower memory use, it often get boggy quickly.
For point (2), have you tried the perspective-viewer library? You can make edits in the UI and then use the "debug view" to copy and paste the new configuration back into your code.
https://perspective.finos.org/
My hot take is that 3D plotting feels bad because 3D plots are bad. You can usually find some alternative way of representing the data
I work on solving 3D problems: numerical methods for PDEs in R^3, computational geometry, computational mechanics, graphics, etc. Being able to make nice 3D plots is super important for this. I agree it's not always necessary, and when a 2D plot suffices, that's the way to go, but that doesn't obviate my need for 3D plots.
3D plots might be neat if there was some widespread way of displaying them. Unfortunately we can only make 2D projections of 3D plots on our computer screens and pieces of paper.
Maybe VR will change that at some point. :shrug:
This is the correct take. There are almost always better ways to plot three dimensional data than trying to project 3D geometry to 2D.
Shameless plug: I'm actively working on a similar project, Datoviz [1], a C/C++ library with thin Python bindings (ctypes). It supports both 2D and 3D but is currently less mature and feature-complete than fastplotlib. It is also lower level (high-level capabilities will soon be provided by VisPy 2.0 which will be built on top of Datoviz, among other possible backends).
My focus is primarily on raw performance, visual quality, and scalability for large datasets—millions, tens of millions of points, or even more.
[1] https://datoviz.org/
Cool to see you on here Cyrille, I've been following your work (and Nicolas's) for a long time. Thanks for all the cool stuff you've been doing!
I have always admired your datoviz library from afar and check the vispy2/vispy2-sandbox libraries on GitHub every few months to check up on it. When do you think 'soon' is?? Really looking forward to it!
Thanks! The code is currently managed by Nicolas Rougier in a GitHub repository that will be made public next week. This repository hosts the "graphics server protocol" (GSP), an intermediate layer between Datoviz and the future high-level plotting API. For the latter, we’ll need community feedback to shape an API philosophy that aligns with VisPy users' needs—let's aim to publish a write-up this month.
Implementing the API on top of GSP should be relatively straightforward, as the core graphics-related mechanisms are handled by GSP/Datoviz. We've created a Slack channel for discussions—contact me privately if you'd like to join.
"Fast" is a bold claim, given the complete lack of benchmarks and the fact that it's written entirely in Python...
I'm certain the host heavy lifting is done by numpy which is a python wrapper around Fortran and C. The visualization heavy lifting is done by pygfx/wgpu-py. wgpu-py has C. I think wgpu-py compiles to WASM to run in the browser. More and more packages are taking this route.
[1] https://github.com/pygfx/pygfx [2] https://github.com/pygfx/wgpu-py
In fastplotlib at the end of the day everything is wgpu under the hood, and as the other poster correctly pointed out about numpy being fortran and C wrappers.
Seems like a nice library, but I have a hard time seeing myself using it over plotly. The plotly express API is just so simple and easy. For example, here's the docs for the histogram plot: https://plotly.com/python/histograms/
This code gives you a fully interactive, and performant, histogram plot:
```python
import plotly.express as px df = px.data.tips() fig = px.histogram(df, x="total_bill") fig.show()
```
Different use cases :) Plotly doesn't give the performance and interactive tools required for many neuroscience visualizations. We also focus more on the primitive graphics and, at least not yet, on the more complex "composite" graphics built with primitives like histograms.
[flagged]
Please stop.
[flagged]
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
https://news.ycombinator.com/newsguidelines.html
I appreciate the warning and if it's not by claude I apologize, but I do think we should be allowed to express scepticism if things posted are just AI slop (and if we have to fear getting banned or what-have-you as a consequence I genuinely think that's worse for HN long term than the alternative).
Don't worry, we wouldn't ban anyone for this. I agree with you that it's a grey area and will take time to work out.
If the skepticism is based on nothing but vibes such commentary is functionally equivalent to something the site guidelines ask you to avoid as it is.
I dunno why you'd say this, neither of us are fans of LLMs and most of this was written before LLMs were a thing :)
Maybe Claude was trained on your code. You should take it as a compliment.
asked claude, said it didn’t do it :)