covers-396876_cc92_7.jpg

Why choose Python for data analysis

author

Written by Karl

May 14, 2019 | 4 min read

Why choose Python for data analysis?

Python is a general-purpose programming language with a huge set of already existing libraries. Hence, it can easily be used to develop scientific and numeric applications that both require a lot of complexity. As the language is already designed to facilitate data analysis and visualization it is excellent for custom big data solutions and even artificial intelligence (AI) and natural language processing tasks. 

Last but not least, the existing data visualization libraries and APIs allow the data to be visualized and presented appealingly and effectively.

Data analysis

The data comes more often than not in huge Excel sheets with a large number of rows and columns. Analyzing the data contained within those cells and making decisions based on that usually requires high computational power and is considered to be very time-consuming. Python solves it via the use of parallel processing via libraries such as Numpy and Pandas. Parallel processing in the most efficient way is done with the help of GPU which helps to cut down the time consuming part by double digit times - general purpose processors may have 4 or 8 cores while NVIDIA GPU may have thousands of cores and a pipeline that supports parallel processing on thousands of threads.

Getting data

In an ideal world, the data is presented in an easily readable format and can be transferred over to the data analysis program with no significant effort. But more often than not the data is not readily available to us and has to be extracted from user behavior or scrap it from the web Python libraries such as Beautifulsoup and Scrapy help to make this task a breeze compared with the alternatives saving the developers hours and days in development.

Data visualization

After the data has been analyzed, it needs to be presented in an easily understandable format. That is where pictograph representation or visualization of the data comes into play. Seeing plain numbers stretching from one end of the screen to the next will soon turn any vision blurry and surely makes it difficult to derive any meaningful insights from the existing data set. That is why it is necessary to present the data in the form of figures such as pie-charts, diagrams and bar graphs. That is where libraries such as Matplotlib and Seaborn making the hard things easy and the impossible into possible.

Plotly’s online platform is mostly used for data visualization while being easily accessible from a Python notebook.

Altair is a declarative statistical visualization python library based on Vega-lite. Due to that the user only needs to mention the links between data columns to the encoding channels, such as x-axis, y-axis, color, etc. and the rest of the plotting details are handled automatically.

Geoplotlib is a popular tool for map creation and plotting geographical data.

Dealing with missing data is cumbersome. The completeness of a dataset can be gauged quickly with Missingno, rather than painstakingly searching through a table. The user can filter and sort data based on completion or spot correlations with a heat map.

Machine learning

Machine learning involves heavy mathematical tasks such as calculus, probability, and matrix operations over thousands of rows and columns making it a very high computational technique All this can be made simple and cost-efficient with the help of scikit-learn machine learning library for Python. IBM research found Python to be the most popular language for machine learning.

Image processing

What if the data is not in the form of text but in the form of images? Not to worry, Python has a solution for that as well. The open-source library opencv is purely dedicated to image processing.

R vs Python?

While R does have a bigger statistical library as is great for specialized statistical work then Python takes the cake as it is better in building the actual analytics tools. R and Python match head in the head if the goal is to find outliers in a dataset, but once a web service is needed in order to collaborate on a bigger scale and thus enable other people to upload datasets and find outliers then Python clearly triumphs. Due to the nature of the Python language, there is already an extensive list of modules and support in order to create the necessary interface where the users can be easily managed and allowed to quickly interact, cooperate and build on a variety of databases in a unified platform.

Looking at the graph below, we can see that Python is catching up fast to R, if it has not surpassed it already.

Who uses Python in data science? 

All this info is great but who actually uses Python in day to day operations when it comes to data science? The popularity alone doesn’t mean that the big players are ready to play along? Here are just some examples of companies using Python for their data science needs:

DARPA ( Defense Advanced Research Projects Agency)

Bank of America

Facebook

  • Facebook turns to the Python for its data analysis and multi-application support (Source: FastCompany)
  • “We have a lot of systems inside Facebook, or infrastructure that allows us to either use Python to talk to those systems or integrates with Python very easily or is written in Python.” - Facebook engineering manager Burc Arpat

UC Berkeley

  • Its fastest-growing class Is Data Science 101 taught in Python (Source: Berkeley)
  • Python is used as a front-end to deep-learning the Universe (Source: Phys.org)

Netflix

  • Python programming language is behind every film you stream in Netflix (Source: ZDnet)
  • Python is used through the full product life cycle, from security tools and recommendation algorithms to its proprietary content distribution network (CDN) (Source: Netflix)


    Interested in knowing more? Get in touch with our industry expert Karl Õkva at karl@thorgate.eu