大数据培训：Flink聚合函数基础入门

作者：张老师浏览次数： 2021-03-29 18:05

Flink当中，聚合函数是非常常用的函数之一，在处理复杂的数据类型上，往往需要使用到这类函数，去完成相应的操作。今天的大数据培训分享，我们就主要来讲讲Flink聚合函数基础入门的部分。

通常来说，用户定义的聚合函数（UDAGG）将一个表（一个或多个具有一个或多个属性的行）聚合为标量值。

上图显示了聚合的示例。假设您有一个包含饮料数据的表格。该表由三列的id，name和price5行。想象一下，您需要找到表中所有饮料的最高价格，即执行max()聚合。您需要检查5行中的每一行，结果将是单个数值。

用户定义的聚合函数通过扩展AggregateFunction类来实现。一个AggregateFunction作品如下。首先，它需要一个accumulator，它是保存聚合的中间结果的数据结构。通过调用createAccumulator()方法创建一个空累加器AggregateFunction。随后，accumulate()为每个输入行调用函数的方法以更新累加器。处理完所有行后，将getValue()调用该函数的方法来计算并返回最终结果。

每种方法都必须使用以下方法AggregateFunction：

createAccumulator()

accumulate()

getValue()

Flink的类型提取工具无法识别复杂的数据类型，例如，如果它们不是基本类型或简单的POJO。类似于ScalarFunction和TableFunction，AggregateFunction提供了指定TypeInformation结果类型（通过 AggregateFunction#getResultType()）和累加器类型（通过AggregateFunction#getAccumulatorType()）的方法。

除了上述方法之外，还有一些可以选择性实施的简约方法。虽然其中一些方法允许系统更有效地执行查询，但其他方法对于某些用例是强制性的。例如，merge()如果聚合函数应该应用于会话组窗口的上下文中，则该方法是必需的（当观察到“连接”它们的行时，需要连接两个会话窗口的累加器）。

所有方法AggregateFunction必须声明为public，而不是static完全按照上面提到的名称命名。该方法createAccumulator，getValue，getResultType，和getAccumulatorType在定义的AggregateFunction抽象类，而另一些则收缩的方法。

为了定义聚合函数，必须扩展基类org.apache.flink.table.functions.AggregateFunction并实现一个（或多个）accumulate方法。该方法accumulate可以使用不同的参数类型重载，并支持可变参数。

/**
* Base class for aggregation functions.
*
* @param <T>   the type of the aggregation result
* @param <ACC> the type of the aggregation accumulator. The accumulator is used to keep the
*             aggregated values which are needed to compute an aggregation result.
*             AggregateFunction represents its state using accumulator, thereby the state of the
*             AggregateFunction must be put into the accumulator.
*/
public abstract class AggregateFunction<T, ACC> extends UserDefinedFunction {

/**
    * Creates and init the Accumulator for this [[AggregateFunction]].
    *
    * @return the accumulator with the initial value
    */
public ACC createAccumulator(); // MANDATORY

/** Processes the input values and update the provided accumulator instance. The method
    * accumulate can be overloaded with different custom types and arguments. An AggregateFunction
    * requires at least one accumulate() method.
    *
    * @param accumulator           the accumulator which contains the current aggregated results
    * @param [user defined inputs] the input value (usually obtained from a new arrived data).
    */
public void accumulate(ACC accumulator, [user defined inputs]); // MANDATORY

/**
    * Retracts the input values from the accumulator instance. The current design assumes the
    * inputs are the values that have been previously accumulated. The method retract can be
    * overloaded with different custom types and arguments. This function must be implemented for
    * datastream bounded over aggregate.
    *
    * @param accumulator           the accumulator which contains the current aggregated results
    * @param [user defined inputs] the input value (usually obtained from a new arrived data).
    */
public void retract(ACC accumulator, [user defined inputs]); // OPTIONAL

/**
    * Merges a group of accumulator instances into one accumulator instance. This function must be
    * implemented for datastream session window grouping aggregate and dataset grouping aggregate.
    *
    * @param accumulator the accumulator which will keep the merged aggregate results. It should
    *                     be noted that the accumulator may contain the previous aggregated
    *                     results. Therefore user should not replace or clean this instance in the
    *                     custom merge method.
    * @param its          an [[java.lang.Iterable]] pointed to a group of accumulators that will be
    *                     merged.
    */
public void merge(ACC accumulator, java.lang.Iterable<ACC> its); // OPTIONAL

/**
    * Called every time when an aggregation result should be materialized.
    * The returned value could be either an early and incomplete result
    * (periodically emitted as data arrive) or the final result of the
    * aggregation.
    *
    * @param accumulator the accumulator which contains the current
    *                    aggregated results
    * @return the aggregation result
    */
public T getValue(ACC accumulator); // MANDATORY

/**
    * Resets the accumulator for this [[AggregateFunction]]. This function must be implemented for
    * dataset grouping aggregate.
    *
    * @param accumulator the accumulator which needs to be reset
    */
public void resetAccumulator(ACC accumulator); // OPTIONAL

/**
    * Returns true if this AggregateFunction can only be applied in an OVER window.
    *
    * @return true if the AggregateFunction requires an OVER window, false otherwise.
    */
public Boolean requiresOver = false; // PRE-DEFINED

/**
    * Returns the TypeInformation of the AggregateFunction's result.
    *
    * @return The TypeInformation of the AggregateFunction's result or null if the result type
    *         should be automatically inferred.
    */
public TypeInformation<T> getResultType = null; // PRE-DEFINED

/**
    * Returns the TypeInformation of the AggregateFunction's accumulator.
    *
    * @return The TypeInformation of the AggregateFunction's accumulator or null if the
    *         accumulator type should be automatically inferred.
    */
public TypeInformation<T> getAccumulatorType = null; // PRE-DEFINED
}

关于大数据培训，Flink聚合函数基础入门的部分，以上就为大家做了大致的介绍了。Flink的聚合函数，在实际运用当中，使用频率还是比较高的，多理解，多练习，才能更快掌握。成都加米谷大数据，专业大数据培训机构，大数据开发，数据分析与挖掘，零基础班本月正在招生中，课程大纲及学习视频资料可联系客服获取！

标签：大数据培训 Flink大数据 Flink

上一篇：大数据学习：Flink广播状态的实现以及示例
下一篇：大数据开发基础之Mybatis

相关推荐

大家都在看

热点排行

推荐文章