Highlights of the Ibis 1.3 release
Ibis 1.3 was just released, after 8 months of development work, with 104 new commits from 16 unique contributors. What is new? In this blog post we will discuss some important features in this new version!
First, if you are new to the Ibis framework world, you can check this blog post I wrote last year, with some introductory information about it.
Some highlighted features of this new version are:
- Addition of a PySparkbackend
- Improvement of geospatial support
- Addition of JSON,JSONBandUUIDdata types
- Initial support for Python 3.8added and support forPython 3.5dropped
- Added new backends and geospatial methods to the documentation
- Renamed the mapdbackend toomniscidb
This blog post is divided into different sections:
- OmniSciDB
- PostgreSQL
- PySpark
- Geospatial support
- Python versions support
import ibis
import pandas as pd
OmniSciDB¶
The mapd backend is now named omniscidb!
An important feature of omniscidb is that now you can define if the connection is IPC (Inter-Process Communication), and you can also specify the GPU device ID you want to use (if you have a NVIDIA card, supported by cudf).
IPC is used to provide shared data support between processes. OmniSciDB uses Apache Arrow to provide IPC support.
con_omni = ibis.omniscidb.connect(
    host='localhost', 
    port='6274',
    user='admin',
    password='HyperInteractive',
    database='ibis_testing',
    ipc=False,
    gpu_device=None
)
con_omni.list_tables()
Also you can now specify ipc or gpu_device directly to the execute method:
t = con_omni.table('functional_alltypes')
expr = t[['id', 'bool_col']].head(5)
df = expr.execute(ipc=False, gpu_device=None)
df
As you can imagine, the result of df is a pandas.DataFrame
type(df)
But if you are using gpu_device the result would be a cudf :)
Note: when
IPC=Trueis used, the code needs to be executed on the same machine where the database is runningNote: when
gpu_deviceis used, 1) it uses IPC (see the note above) and 2) it needs a NVIDIA card supported bycudf.
Another interesting feature is that now omniscidb also supports shapefiles (input) and geopandas (output)!
Check out the Geospatial support section below to see more details!
Also the new version adds translations for more window operations for the omniscidb backend, such as: 
DenseRank, RowNumber, MinRank, Count, PercentRank/CumeDist.
For more information about window operations, check the Window functions documentation section.
PostgreSQL¶
Some of the highlighted features for the PostgreSQL backend are new data types included, such as:
JSON, JSONB and UUID.
from uuid import uuid4 
uuid_value = ibis.literal(uuid4(), type='uuid')
uuid_value == ibis.literal(uuid4(), type='uuid')
import json
json_value = ibis.literal(json.dumps({"id": 1}), type='json')
json_value
jsonb_value = ibis.literal(json.dumps({"id": 1}).encode('utf8'), type='jsonb')
jsonb_value
Another important new features on PostgreSQL backend is the support of new geospatial operations, such as
- GeometryType
- GeometryN
- IsValid
- LineLocatePoint
- LineMerge
- LineSubstring
- OrderingEquals
- Union
Also, now it has support for two geospatial data types: MULTIPOINT and MULTILINESTRING.
Check out the Geospatial support section below to see some usage examples of geospatial operations!
PySpark¶
This new version also includes support for a new backend: PySpark!
Let's do the first steps with this new backend starting with a Spark session creation.
import os
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.types as pt
from pathlib import Path
# spark session and pyspark connection
spark_session = SparkSession.builder.getOrCreate()
con_pyspark = ibis.pyspark.connect(session=spark_session)
We can use spark or pandas for reading from CSV file. In this example, we will use pandas.
data_directory = Path(
    os.path.join(
        os.path.dirname(ibis.__path__[0]),
        'ci',
        'ibis-testing-data'
    )
)
pd_df_alltypes = pd.read_csv(data_directory / 'functional_alltypes.csv')
pd_df_alltypes.info()
Now, we can create a Spark DataFrame and we will create a temporary view from this data frame. Also we should enforce the desired types for each column.
def pyspark_cast(df, col_types):
    for col, dtype in col_types.items():
        df = df.withColumn(col, df[col].cast(dtype))
    return df
ps_df_alltypes = spark_session.createDataFrame(pd_df_alltypes)
ps_df_alltypes = pyspark_cast(
    ps_df_alltypes, {
        'index': 'integer',
        'Unnamed: 0': 'integer',
        'id': 'integer',
        'bool_col': 'boolean',
        'tinyint_col': 'byte',
        'smallint_col': 'short',
        'int_col': 'integer',
        'bigint_col': 'long',
        'float_col': 'float',
        'double_col': 'double',
        'date_string_col': 'string',
        'string_col': 'string',
        'timestamp_col': 'timestamp',
        'year': 'integer',
        'month': 'integer'
    }
)
# use ``SparkSession`` to create a table
ps_df_alltypes.createOrReplaceTempView('functional_alltypes')
con_pyspark.list_tables()
Check if all columns were created with the desired data type:
t = con_pyspark.table('functional_alltypes')
t
Different than a SQL backend, that returns a SQL statement, the returned 
object from the PySpark compile method is a PySpark DataFrame:
expr = t.head()
expr_comp = expr.compile()
type(expr_comp)
expr_comp
To convert the compiled expression to a Pandas DataFrame, you can use the toPandas method.
The result should be the same as that returned by the execute method.
assert all(expr.execute() == expr_comp.toPandas())
expr.execute()
To finish this section, we can play a little bit with some aggregation operations.
expr = t
expr = expr.groupby('string_col').aggregate(
    int_col_mean=t.int_col.mean(),
    int_col_sum=t.int_col.sum(),
    int_col_count=t.int_col.count(),
)
expr.execute()
Check out the PySpark Ibis backend API documentation and the tutorials for more details.
Geospatial support¶
Currently, ibis.omniscidb and ibis.postgres are the only Ibis backends that support geospatial features.
In this section we will check some geospatial features using the PostgreSQL backend.
con_psql = ibis.postgres.connect(
    host='localhost',
    port=5432,
    user='postgres',
    password='postgres',
    database='ibis_testing'
)
con_psql.list_tables()
Two important features are that it support shape objects (input) and geopandas dataframe (output)!
So, let's import shapely to create a simple shape point and polygon.
import shapely
shp_point = shapely.geometry.Point((20, 10))
shp_point
shp_polygon_1 = shapely.geometry.Polygon([(20, 10), (40, 30), (40, 20), (20, 10)])
shp_polygon_1
Now, let's create a Ibis table expression to manipulate a "geo" table:
t_geo = con_psql.table('geo')
df_geo = t_geo.execute()
df_geo
And the type of df_geo is ...  a geopandas dataframe!
type(df_geo)
So you can take the advantage of GeoPandas features too!
df_geo.set_geometry('geo_multipolygon').head(1).plot();
Now, let's check if there are any geo_multipolygon's that contain the shape point we just created.
t_geo[
    t_geo.geo_multipolygon, 
    t_geo['geo_multipolygon'].contains(shp_point).name('contains_point_1')
].execute()
Final words¶
Do you want to play more with Ibis framework?
You can install it from PyPI:
python -m pip install --upgrade ibis-framework==1.3.0Or from conda-forge:
conda install ibis-framework=1.3.0 -c conda-forgeCheck out some interesting tutorials to help you to start on Ibis: https://docs.ibis-project.org/tutorial.html. If you are coming from the SQL world, maybe Ibis for SQL Programmers documentation section will be helpful. Have fun!